API Reference
tnh_scholar
TNH Scholar: Text Processing and Analysis Tools
TNH Scholar is an AI-driven project designed to explore, query, process and translate
the teachings of Thich Nhat Hanh and other Plum Village Dharma Teachers. The project
aims to create a resource for practitioners and scholars to deeply engage with
mindfulness and spiritual wisdom through natural language processing and machine
learning models.
Core Features
- Audio transcription and processing
- Multi-lingual text processing and translation
- Pattern-based text analysis
- OCR processing for historical documents
- CLI tools for batch processing
Package Structure
- tnh_scholar/
- CLI_tools/ - Command line interface tools
- audio_processing/ - Audio file handling and transcription
- journal_processing/ - Journal and publication processing
- ocr_processing/ - Optical character recognition tools
- text_processing/ - Core text processing utilities
- video_processing/ - Video file handling and transcription
- utils/ - Shared utility functions
- xml_processing/ - XML parsing and generation
Environment Configuration
- The package uses environment variables for configuration, including:
- TNH_PATTERN_DIR - Directory for text processing patterns
- OPENAI_API_KEY - OpenAI API authentication
- GOOGLE_VISION_KEY - Google Cloud Vision API key for OCR
CLI Tools
- audio-transcribe - Audio file transcription utility
- tnh-fab - Text processing and analysis toolkit
For more information, see: - Documentation: https://aaronksolomon.github.io/tnh-scholar/ - Source: https://github.com/aaronksolomon/tnh-scholar - Issues: https://github.com/aaronksolomon/tnh-scholar/issues
Dependencies
- Core: click, pydantic, openai, yt-dlp
- Optional: streamlit (GUI), spacy (NLP), google-cloud-vision (OCR)
TNH_CLI_TOOLS_DIR = TNH_ROOT_SRC_DIR / 'cli_tools'
module-attribute
TNH_CONFIG_DIR = Path.home() / '.config' / 'tnh-scholar'
module-attribute
TNH_DEFAULT_PATTERN_DIR = TNH_PROJECT_ROOT_DIR / 'patterns'
module-attribute
TNH_LOG_DIR = TNH_CONFIG_DIR / 'logs'
module-attribute
TNH_METADATA_PROCESS_FIELD = 'tnh_processing'
module-attribute
TNH_PROJECT_ROOT_DIR = TNH_ROOT_SRC_DIR.resolve().parent.parent
module-attribute
TNH_ROOT_SRC_DIR = Path(__file__).resolve().parent
module-attribute
__version__ = '0.1.3'
module-attribute
ai_text_processing
Public surface for tnh_scholar.ai_text_processing.
Historically this module eagerly imported multiple submodules with heavy
dependencies (e.g., audio codecs, ML toolkits) which made importing lightweight
components such as Prompt surprisingly expensive and brittle in test
environments. We now lazily import the concrete implementations on demand so
that callers can depend on just the pieces they need.
__all__ = ['OpenAIProcessor', 'SectionParser', 'SectionProcessor', 'find_sections', 'process_text', 'process_text_by_paragraphs', 'process_text_by_sections', 'get_pattern', 'translate_text_by_lines', 'openai_process_text', 'GitBackedRepository', 'LocalPromptManager', 'Prompt', 'PromptCatalog', 'AIResponse', 'LogicalSection', 'SectionEntry', 'TextObject', 'TextObjectInfo']
module-attribute
AIResponse
Bases: BaseModel
Class for dividing large texts into AI-processable segments while maintaining broader document context.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
document_metadata = Field(..., description='Available Dublin Core standard metadata in human-readable YAML format')
class-attribute
instance-attribute
document_summary = Field(..., description="Concise, comprehensive overview of the text's content and purpose")
class-attribute
instance-attribute
key_concepts = Field(..., description='Important terms, ideas, or references that appear throughout the text')
class-attribute
instance-attribute
language = Field(..., description='ISO 639-1 language code')
class-attribute
instance-attribute
narrative_context = Field(..., description='Concise overview of how the text develops or progresses as a whole')
class-attribute
instance-attribute
sections
instance-attribute
GitBackedRepository
Manages versioned storage of prompts using Git.
Provides basic Git operations while hiding complexity: - Automatic versioning of changes - Basic conflict resolution - History tracking
Source code in src/tnh_scholar/ai_text_processing/prompts.py
337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 | |
repo = Repo(repo_path)
instance-attribute
repo_path = repo_path
instance-attribute
__init__(repo_path)
Initialize or connect to Git repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_path
|
Path
|
Path to repository directory |
required |
Raises:
| Type | Description |
|---|---|
GitCommandError
|
If Git operations fail |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 | |
display_history(file_path, max_versions=0)
Display history of changes for a file with diffs between versions.
Shows most recent changes first, limited to max_versions entries. For each change shows: - Commit info and date - Stats summary of changes - Detailed color diff with 2 lines of context
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to file in repository |
required |
max_versions
|
int
|
Maximum number of versions to show, |
0
|
Example
repo.display_history(Path("prompts/format_dharma_talk.yaml")) Commit abc123def (2024-12-28 14:30:22): 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/prompts/format_dharma_talk.yaml ... ...
Source code in src/tnh_scholar/ai_text_processing/prompts.py
487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 | |
update_file(file_path)
Stage and commit changes to a file in the Git repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Absolute or relative path to the file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Commit hash if changes were made. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the file is outside the repository. |
GitCommandError
|
If Git operations fail. |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 | |
LocalPromptManager
A simple singleton implementation of PromptManager that ensures only one instance is created and reused throughout the application lifecycle.
This class wraps the PromptManager to provide efficient prompt loading by maintaining a single reusable instance.
Attributes:
| Name | Type | Description |
|---|---|---|
_instance |
Optional[SingletonPromptManager]
|
The singleton instance |
_prompt_manager |
Optional[PromptManager]
|
The wrapped PromptManager instance |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 | |
prompt_manager
property
Lazy initialization of the PromptManager instance.
Returns:
| Name | Type | Description |
|---|---|---|
PromptManager |
PromptCatalog
|
The wrapped PromptManager instance |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If PATTERN_REPO is not properly configured |
__new__()
Create or return the singleton instance.
Returns:
| Name | Type | Description |
|---|---|---|
SingletonPromptManager |
LocalPromptManager
|
The singleton instance |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
946 947 948 949 950 951 952 953 954 955 956 | |
get_prompt(name)
Get a prompt by name.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
985 986 987 | |
LogicalSection
Bases: BaseModel
Represents a contextually meaningful segment of a larger text.
Sections should preserve natural breaks in content (explicit section markers, topic shifts, argument development, narrative progression) while staying within specified size limits in order to create chunks suitable for AI processing.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | |
start_line = Field(..., description='Starting line number that begins this logical segment')
class-attribute
instance-attribute
title = Field(..., description="Descriptive title of section's key content")
class-attribute
instance-attribute
OpenAIProcessor
Bases: TextProcessor
OpenAI-based text processor implementation.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
max_tokens = max_tokens
instance-attribute
model = model
instance-attribute
__init__(model=None, max_tokens=0)
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
78 79 80 81 82 | |
process_text(input_str, instructions, response_format=None, max_tokens=0, **kwargs)
Process text using OpenAI API with optional structured output.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
Prompt
Base Prompt class for version-controlled template prompts.
Prompts contain: - Instructions: The main prompt instructions as a Jinja2 template. Note: Instructions are intended to be saved in markdown format in a .md file. - Template fields: Default values for template variables - Metadata: Name and identifier information
Version control is handled externally through Git, not in the prompt itself. Prompt identity is determined by the combination of identifiers.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the prompt |
instructions |
str
|
The Jinja2 template string for this prompt |
default_template_fields |
Dict[str, str]
|
Default values for template variables |
_allow_empty_vars |
bool
|
Whether to allow undefined template variables |
_env |
Environment
|
Configured Jinja2 environment instance |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | |
default_template_fields = default_template_fields or {}
instance-attribute
instructions = instructions
instance-attribute
name = name
instance-attribute
path = path
instance-attribute
__eq__(other)
Compare prompts based on their content.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
326 327 328 329 330 | |
__hash__()
Hash based on content hash for container operations.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
332 333 334 | |
__init__(name, instructions, path=None, default_template_fields=None, allow_empty_vars=False)
Initialize a new Prompt instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique name identifying the prompt |
required |
instructions
|
MarkdownStr
|
Jinja2 template string containing the prompt |
required |
default_template_fields
|
Optional[Dict[str, str]]
|
Optional default values for template variables |
None
|
allow_empty_vars
|
bool
|
Whether to allow undefined template variables |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If name or instructions are empty |
TemplateError
|
If template syntax is invalid |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
apply_template(field_values=None)
Apply template values to prompt instructions using Jinja2.
Values precedence (highest to lowest): 1. field_values (explicitly passed) 2. frontmatter values (from prompt file) 3. default_template_fields (prompt defaults)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field_values
|
Optional[Dict[str, str]]
|
Values to substitute into the template. If None, uses frontmatter/defaults. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Rendered instructions with template values applied. |
Raises:
| Type | Description |
|---|---|
TemplateError
|
If template rendering fails |
ValueError
|
If required template variables are missing |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
content_hash()
Generate a SHA-256 hash of the prompt content.
Useful for quick content comparison and change detection.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Hexadecimal string of the SHA-256 hash |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 | |
extract_frontmatter()
Extract and validate YAML frontmatter from markdown instructions.
Returns:
| Type | Description |
|---|---|
Optional[Dict[str, Any]]
|
Optional[Dict]: Frontmatter data if found and valid, None otherwise |
Note
Frontmatter must be at the very start of the file and properly formatted.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |
from_dict(data)
classmethod
Create prompt instance from dictionary data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dict[str, Any]
|
Dictionary containing prompt data |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Prompt |
Prompt
|
New prompt instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required fields are missing |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
get_content_without_frontmatter()
Get markdown content with frontmatter removed.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Markdown content without frontmatter |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
223 224 225 226 227 228 229 230 231 | |
source_bytes()
Best-effort raw bytes for prompt hashing.
Prefers hashing exact on-disk bytes including front-matter.
We therefore first try to read from prompt_path. If that fails, we fall back
to hashing the concatenation of known templates. In V1, only
the instructions (system template) are used for rendering.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | |
to_dict()
Convert prompt to dictionary for serialization.
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict containing all prompt data in serializable format |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
288 289 290 291 292 293 294 295 296 297 298 299 | |
update_frontmatter(new_data)
Update or add frontmatter to the markdown content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_data
|
Dict[str, Any]
|
Dictionary of frontmatter fields to update |
required |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
PromptCatalog
Main interface for prompt management system.
Provides high-level operations: - Prompt creation and loading - Automatic versioning - Safe concurrent access - Basic history tracking - Case-insensitive prompt names (stored as lowercase)
Source code in src/tnh_scholar/ai_text_processing/prompts.py
695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 | |
access_manager = ConcurrentAccessManager(self.base_path / '.locks')
instance-attribute
base_path = Path(base_path).resolve()
instance-attribute
repo = GitBackedRepository(self.base_path)
instance-attribute
__init__(base_path)
Initialize prompt management system.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Base directory for prompt storage |
required |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 | |
get_path(prompt_name)
Recursively search for a prompt file with the given name (case-insensitive) in base_path and all subdirectories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_name
|
str
|
prompt name (without extension) to search for |
required |
Returns:
| Type | Description |
|---|---|
Optional[Path]
|
Optional[Path]: Full path to the found prompt file, or None if not found |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 | |
load(prompt_name)
Load the .md prompt file by name, extract placeholders, and return a fully constructed Prompt object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_name
|
str
|
Name of the prompt (without .md extension). |
required |
Returns:
| Type | Description |
|---|---|
Prompt
|
A new Prompt object whose 'instructions' is the file's text |
Prompt
|
and whose 'template_fields' are inferred from placeholders in |
Prompt
|
those instructions. |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 | |
save(prompt, subdir=None)
Source code in src/tnh_scholar/ai_text_processing/prompts.py
782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 | |
show_history(prompt_name)
Source code in src/tnh_scholar/ai_text_processing/prompts.py
852 853 854 855 856 857 | |
verify_repository(base_path)
classmethod
Verify repository integrity and uniqueness of prompt names.
Performs the following checks: 1. Validates Git repository structure. 2. Ensures no duplicate prompt names exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Repository path to verify. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the repository is valid |
bool
|
and contains no duplicate prompt files. |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 | |
SectionEntry
Bases: NamedTuple
Represents a section with its content during iteration.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
72 73 74 75 76 77 | |
content
instance-attribute
number
instance-attribute
range
instance-attribute
title
instance-attribute
SectionParser
Generates structured section breakdowns of text content.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
review_count = review_count
instance-attribute
section_pattern = section_pattern
instance-attribute
section_scanner = section_scanner
instance-attribute
__init__(section_scanner, section_pattern, review_count=DEFAULT_REVIEW_COUNT)
Initialize section generator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
Implementation of TextProcessor |
required | |
pattern
|
Pattern object containing section generation instructions |
required | |
max_tokens
|
Maximum tokens for response |
required | |
section_count
|
Target number of sections |
required | |
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |
find_sections(text, section_count_target=None, segment_size_target=None, template_dict=None)
Generate section breakdown of input text. The text must be split up by newlines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Input TextObject to process |
required |
section_count_target
|
Optional[int]
|
the target for the number of sections to find |
None
|
segment_size_target
|
Optional[int]
|
the target for the number of lines per section (if section_count_target is specified, this value will be set to generate correct segments) |
None
|
template_dict
|
Optional[Dict[str, str]]
|
Optional additional template variables |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject containing section breakdown |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | |
SectionProcessor
Handles section-based XML text processing with configurable output handling.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
pattern = pattern
instance-attribute
processor = processor
instance-attribute
template_dict = template_dict
instance-attribute
wrap_in_document = wrap_in_document
instance-attribute
__init__(processor, pattern, template_dict, wrap_in_document=True)
Initialize the XML section processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
TextProcessor
|
Implementation of TextProcessor to use |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
template_dict
|
Dict
|
Dictionary for template substitution |
required |
wrap_in_document
|
bool
|
Whether to wrap output in |
True
|
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
process_paragraphs(text)
Process transcript by paragraphs (as sections), yielding ProcessedSection objects. Paragraphs are assumed to be given as newline separated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
TextObject to process |
required |
Yields:
| Name | Type | Description |
|---|---|---|
ProcessedSection |
ProcessedSection
|
One processed paragraph at a time, containing: - title: Paragraph number (e.g., 'Paragraph 1') - original_str: Raw paragraph text - processed_str: Processed paragraph text - metadata: Optional metadata dict |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
process_sections(text_object)
Process transcript sections and yield results one section at a time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript
|
Text to process |
required | |
text_object
|
TextObject
|
Object containing section definitions |
required |
Yields:
| Name | Type | Description |
|---|---|---|
ProcessedSection |
ProcessedSection
|
One processed section at a time, containing: - title: Section title (English or original language) - original_text: Raw text segment - processed_text: Processed text content - start_line: Starting line number |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 | |
TextObject
Manages text content with section organization and metadata tracking.
TextObject serves as the core container for text processing, providing: - Line-numbered text content management - Language identification - Section organization and access - Metadata tracking including incorporated processing stages
The class allows for section boundaries through line numbering, allowing sections to be defined by start lines without explicit end lines. Subsequent sections implicitly end where the next section begins. SectionObjects are utilized to represent sections.
Attributes:
| Name | Type | Description |
|---|---|---|
num_text |
NumberedText
|
Line-numbered text content manager |
language |
str
|
ISO 639-1 language code for the text content |
_sections |
List[SectionObject]
|
Internal list of text sections with boundaries |
_metadata |
Metadata
|
Processing and content metadata container |
Example
content = NumberedText("Line 1\nLine 2\nLine 3") obj = TextObject(content, language="en")
Source code in src/tnh_scholar/ai_text_processing/text_object.py
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 | |
content
property
language = language or get_language_code_from_text(num_text.content)
instance-attribute
last_line_num
property
metadata
property
Access to metadata dictionary.
metadata_str
property
num_text = num_text
instance-attribute
numbered_content
property
section_count
property
sections
property
Access to sections list.
__init__(num_text, language=None, sections=None, metadata=None)
Initialize a TextObject with content and optional organizing components.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_text
|
NumberedText
|
Text content with line numbering |
required |
language
|
Optional[str]
|
ISO 639-1 language code. If None, auto-detected from content |
None
|
sections
|
Optional[List[SectionObject]]
|
Initial sections defining text organization. If None, text is considered un-sectioned. |
None
|
metadata
|
Optional[Metadata]
|
Initial metadata. If None, creates empty metadata container |
None
|
Note
Until sections are established, section-based methods will raise a value error if called.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
__iter__()
Iterate through sections, yielding full section information.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | |
__str__()
Source code in src/tnh_scholar/ai_text_processing/text_object.py
230 231 | |
export_info(source_file=None)
Export serializable state.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
387 388 389 390 391 392 393 394 395 396 397 | |
from_info(info, metadata, num_text)
classmethod
Create TextObject from info and content.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 | |
from_response(response, existing_metadata, num_text)
classmethod
Create TextObject from AI response format.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 | |
from_section_file(section_file, source=None)
classmethod
Create TextObject from a section info file, loading content from source_file. Metadata is extracted from the source_file or from content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
section_file
|
Path
|
Path to JSON file containing TextObjectInfo |
required |
source
|
Optional[str]
|
Optional source string in case no source file is found. |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If source_file is missing from section info |
FileNotFoundError
|
If either section_file or source_file not found |
Source code in src/tnh_scholar/ai_text_processing/text_object.py
425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 | |
from_str(text, language=None, sections=None, metadata=None)
classmethod
Create a TextObject from a string, extracting any frontmatter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text string, potentially containing frontmatter |
required |
language
|
Optional[str]
|
ISO language code |
None
|
sections
|
Optional[List[SectionObject]]
|
List of section objects |
None
|
metadata
|
Optional[Metadata]
|
Optional base metadata to merge with frontmatter |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance with combined metadata |
Source code in src/tnh_scholar/ai_text_processing/text_object.py
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | |
from_text_file(file)
classmethod
Source code in src/tnh_scholar/ai_text_processing/text_object.py
417 418 419 420 421 422 423 | |
get_section_content(index)
Source code in src/tnh_scholar/ai_text_processing/text_object.py
374 375 376 377 378 379 380 381 382 383 384 385 | |
load(path, config=None)
classmethod
Load TextObject from file with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Input file path |
required |
config
|
Optional[LoadConfig]
|
Optional loading configuration. If not provided, loads directly from text file. |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance |
Usage
Load from text file with frontmatter
obj = TextObject.load(Path("content.txt"))
Load state from JSON with source content string
config = LoadConfig( format=StorageFormat.JSON, source_content="Text content..." ) obj = TextObject.load(Path("state.json"), config)
Load state from JSON with source content file
config = LoadConfig( format=StorageFormat.JSON, source_content=Path("content.txt") ) obj = TextObject.load(Path("state.json"), config)
Source code in src/tnh_scholar/ai_text_processing/text_object.py
505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 | |
merge_metadata(new_metadata, override=False)
Merge new metadata with existing metadata.
For now, performs simple dict-like union (|=) but can be extended to handle more complex merging logic in the future (e.g., merging nested structures, handling conflicts, merging arrays).
Args: new_metadata: Metadata to merge with existing metadata override: If True, new_metadata values override existing values If False, existing values are preserved
Source code in src/tnh_scholar/ai_text_processing/text_object.py
324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 | |
save(path, output_format=StorageFormat.TEXT, source_file=None, pretty=True)
Save TextObject to file in specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path |
required |
format
|
"text" for full content+metadata or "json" for serialized state |
required | |
source_file
|
Optional[Path]
|
Optional source file to record in metadata |
None
|
pretty
|
bool
|
For JSON output, whether to pretty print |
True
|
Source code in src/tnh_scholar/ai_text_processing/text_object.py
476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 | |
transform(data_str=None, language=None, metadata=None, process_metadata=None, sections=None)
Update TextObject content and metadata in place.
Optionally modifies the object's content, language, and adds process tracking. Process history is maintained in metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
New text content |
required | |
language
|
Optional[str]
|
New language code |
None
|
process_tag
|
Identifier for the process performed |
required |
Source code in src/tnh_scholar/ai_text_processing/text_object.py
552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 | |
update_metadata(**kwargs)
Update metadata with new key-value pairs.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
354 355 356 357 | |
validate_sections()
Basic validation of section integrity.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
359 360 361 362 363 364 365 366 367 368 369 370 371 372 | |
TextObjectInfo
Bases: BaseModel
Serializable information about a text and its sections.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
141 142 143 144 145 146 147 148 149 150 151 152 153 | |
language
instance-attribute
metadata
instance-attribute
sections
instance-attribute
source_file = None
class-attribute
instance-attribute
model_post_init(__context)
Ensure metadata is always a Metadata instance after initialization.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
148 149 150 151 152 153 | |
__dir__()
Source code in src/tnh_scholar/ai_text_processing/__init__.py
76 77 | |
__getattr__(name)
Source code in src/tnh_scholar/ai_text_processing/__init__.py
65 66 67 68 69 70 71 72 73 | |
find_sections(text, source_language=None, section_pattern=None, section_model=None, max_tokens=DEFAULT_SECTION_RESULT_MAX_SIZE, section_count=None, review_count=DEFAULT_REVIEW_COUNT, template_dict=None)
High-level function for generating text sections.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Input text |
required |
source_language
|
Optional[str]
|
ISO 639-1 language code |
None
|
pattern
|
Optional custom pattern (uses default if None) |
required | |
model
|
Optional model identifier |
required | |
max_tokens
|
int
|
Maximum tokens for response |
DEFAULT_SECTION_RESULT_MAX_SIZE
|
section_count
|
Optional[int]
|
Target number of sections |
None
|
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
template_dict
|
Optional[Dict[str, str]]
|
Optional additional template variables |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject containing section breakdown |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | |
get_pattern(name)
Get a pattern by name using the singleton PatternManager.
This is a more efficient version that reuses a single PatternManager instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the pattern to load |
required |
Returns:
| Type | Description |
|---|---|
Prompt
|
The loaded pattern |
Raises:
| Type | Description |
|---|---|
ValueError
|
If pattern name is invalid |
FileNotFoundError
|
If pattern file doesn't exist |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 | |
openai_process_text(text_input, process_instructions, model=None, response_format=None, batch=False, max_tokens=0)
postprocessing a transcription.
Source code in src/tnh_scholar/ai_text_processing/openai_process_interface.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
process_text(text, pattern, source_language=None, model=None, template_dict=None)
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 | |
process_text_by_paragraphs(text, template_dict, pattern=None, model=None)
High-level function for processing text paragraphs, yielding ProcessedSection objects. Assumes paragraphs are separated by newlines. Uses DEFAULT_XML_FORMAT_PATTERN as default pattern for text processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
TextObject to process |
required |
template_dict
|
Dict[str, str]
|
Dictionary for template substitution |
required |
pattern
|
Optional[Prompt]
|
Pattern object containing processing instructions |
None
|
model
|
Optional[str]
|
Optional model identifier for processor |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Generator for ProcessedSection objects (one per paragraph) |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 | |
process_text_by_sections(text_object, template_dict, pattern, model=None)
High-level function for processing text sections with configurable output handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript
|
Text to process |
required | |
text_object
|
TextObject
|
Object containing section definitions |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
template_dict
|
Dict
|
Dictionary for template substitution |
required |
model
|
Optional[str]
|
Optional model identifier for processor |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Generator for ProcessedSections |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 | |
translate_text_by_lines(text, source_language=None, target_language=DEFAULT_TARGET_LANGUAGE, pattern=None, model=None, style=None, segment_size=None, context_lines=None, review_count=None, template_dict=None)
Source code in src/tnh_scholar/ai_text_processing/line_translator.py
382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 | |
ai_text_processing
DEFAULT_MIN_SECTION_COUNT = 3
module-attribute
DEFAULT_OPENAI_MODEL = 'gpt-4o'
module-attribute
DEFAULT_PARAGRAPH_FORMAT_PATTERN = 'default_xml_paragraph_format'
module-attribute
DEFAULT_PUNCTUATE_MODEL = 'gpt-4o'
module-attribute
DEFAULT_PUNCTUATE_PATTERN = 'default_punctuate'
module-attribute
DEFAULT_PUNCTUATE_STYLE = 'APA'
module-attribute
DEFAULT_REVIEW_COUNT = 5
module-attribute
DEFAULT_SECTION_PATTERN = 'default_section'
module-attribute
DEFAULT_SECTION_RANGE_VAR = 2
module-attribute
DEFAULT_SECTION_RESULT_MAX_SIZE = 4000
module-attribute
DEFAULT_SECTION_TOKEN_SIZE = 650
module-attribute
DEFAULT_XML_FORMAT_PATTERN = 'default_xml_format'
module-attribute
SECTION_SEGMENT_SIZE_WARNING_LIMIT = 5
module-attribute
logger = get_child_logger(__name__)
module-attribute
GeneralProcessor
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | |
pattern = pattern
instance-attribute
processor = processor
instance-attribute
review_count = review_count
instance-attribute
source_language = source_language
instance-attribute
__init__(processor, pattern, source_language=None, review_count=DEFAULT_REVIEW_COUNT)
Initialize general processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_punctuator
|
Implementation of TextProcessor |
required | |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
section_count
|
Target number of sections |
required | |
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 | |
process_text(text, template_dict=None)
process a text based on a pattern and source language.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | |
OpenAIProcessor
Bases: TextProcessor
OpenAI-based text processor implementation.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
max_tokens = max_tokens
instance-attribute
model = model
instance-attribute
__init__(model=None, max_tokens=0)
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
78 79 80 81 82 | |
process_text(input_str, instructions, response_format=None, max_tokens=0, **kwargs)
Process text using OpenAI API with optional structured output.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
ProcessedSection
dataclass
Represents a processed section of text with its metadata.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
44 45 46 47 48 49 50 | |
metadata = field(default_factory=dict)
class-attribute
instance-attribute
original_str
instance-attribute
processed_str
instance-attribute
title
instance-attribute
__init__(title, original_str, processed_str, metadata=dict())
SectionParser
Generates structured section breakdowns of text content.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
review_count = review_count
instance-attribute
section_pattern = section_pattern
instance-attribute
section_scanner = section_scanner
instance-attribute
__init__(section_scanner, section_pattern, review_count=DEFAULT_REVIEW_COUNT)
Initialize section generator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
Implementation of TextProcessor |
required | |
pattern
|
Pattern object containing section generation instructions |
required | |
max_tokens
|
Maximum tokens for response |
required | |
section_count
|
Target number of sections |
required | |
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |
find_sections(text, section_count_target=None, segment_size_target=None, template_dict=None)
Generate section breakdown of input text. The text must be split up by newlines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Input TextObject to process |
required |
section_count_target
|
Optional[int]
|
the target for the number of sections to find |
None
|
segment_size_target
|
Optional[int]
|
the target for the number of lines per section (if section_count_target is specified, this value will be set to generate correct segments) |
None
|
template_dict
|
Optional[Dict[str, str]]
|
Optional additional template variables |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject containing section breakdown |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | |
SectionProcessor
Handles section-based XML text processing with configurable output handling.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
pattern = pattern
instance-attribute
processor = processor
instance-attribute
template_dict = template_dict
instance-attribute
wrap_in_document = wrap_in_document
instance-attribute
__init__(processor, pattern, template_dict, wrap_in_document=True)
Initialize the XML section processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
TextProcessor
|
Implementation of TextProcessor to use |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
template_dict
|
Dict
|
Dictionary for template substitution |
required |
wrap_in_document
|
bool
|
Whether to wrap output in |
True
|
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
process_paragraphs(text)
Process transcript by paragraphs (as sections), yielding ProcessedSection objects. Paragraphs are assumed to be given as newline separated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
TextObject to process |
required |
Yields:
| Name | Type | Description |
|---|---|---|
ProcessedSection |
ProcessedSection
|
One processed paragraph at a time, containing: - title: Paragraph number (e.g., 'Paragraph 1') - original_str: Raw paragraph text - processed_str: Processed paragraph text - metadata: Optional metadata dict |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
process_sections(text_object)
Process transcript sections and yield results one section at a time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript
|
Text to process |
required | |
text_object
|
TextObject
|
Object containing section definitions |
required |
Yields:
| Name | Type | Description |
|---|---|---|
ProcessedSection |
ProcessedSection
|
One processed section at a time, containing: - title: Section title (English or original language) - original_text: Raw text segment - processed_text: Processed text content - start_line: Starting line number |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 | |
TextProcessor
Bases: ABC
Abstract base class for text processors that can return Pydantic objects.
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
process_text(input_str, instructions, response_format=None, **kwargs)
abstractmethod
Process text according to instructions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
Input text to process |
required | |
instructions
|
str
|
Processing instructions |
required |
response_object
|
Optional Pydantic class for structured output |
required | |
**kwargs
|
Additional processing parameters |
{}
|
Returns:
| Type | Description |
|---|---|
ProcessorResult
|
Either string or Pydantic model instance based on response_model |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
find_sections(text, source_language=None, section_pattern=None, section_model=None, max_tokens=DEFAULT_SECTION_RESULT_MAX_SIZE, section_count=None, review_count=DEFAULT_REVIEW_COUNT, template_dict=None)
High-level function for generating text sections.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Input text |
required |
source_language
|
Optional[str]
|
ISO 639-1 language code |
None
|
pattern
|
Optional custom pattern (uses default if None) |
required | |
model
|
Optional model identifier |
required | |
max_tokens
|
int
|
Maximum tokens for response |
DEFAULT_SECTION_RESULT_MAX_SIZE
|
section_count
|
Optional[int]
|
Target number of sections |
None
|
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
template_dict
|
Optional[Dict[str, str]]
|
Optional additional template variables |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject containing section breakdown |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | |
get_pattern(name)
Get a pattern by name using the singleton PatternManager.
This is a more efficient version that reuses a single PatternManager instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the pattern to load |
required |
Returns:
| Type | Description |
|---|---|
Prompt
|
The loaded pattern |
Raises:
| Type | Description |
|---|---|
ValueError
|
If pattern name is invalid |
FileNotFoundError
|
If pattern file doesn't exist |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 | |
process_text(text, pattern, source_language=None, model=None, template_dict=None)
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 | |
process_text_by_paragraphs(text, template_dict, pattern=None, model=None)
High-level function for processing text paragraphs, yielding ProcessedSection objects. Assumes paragraphs are separated by newlines. Uses DEFAULT_XML_FORMAT_PATTERN as default pattern for text processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
TextObject to process |
required |
template_dict
|
Dict[str, str]
|
Dictionary for template substitution |
required |
pattern
|
Optional[Prompt]
|
Pattern object containing processing instructions |
None
|
model
|
Optional[str]
|
Optional model identifier for processor |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Generator for ProcessedSection objects (one per paragraph) |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 | |
process_text_by_sections(text_object, template_dict, pattern, model=None)
High-level function for processing text sections with configurable output handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript
|
Text to process |
required | |
text_object
|
TextObject
|
Object containing section definitions |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
template_dict
|
Dict
|
Dictionary for template substitution |
required |
model
|
Optional[str]
|
Optional model identifier for processor |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Generator for ProcessedSections |
Source code in src/tnh_scholar/ai_text_processing/ai_text_processing.py
509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 | |
general_processor
line_translator
DEFAULT_TARGET_LANGUAGE = 'en'
module-attribute
DEFAULT_TRANSLATE_CONTEXT_LINES = 3
module-attribute
DEFAULT_TRANSLATE_STYLE = "'American Dharma Teaching'"
module-attribute
DEFAULT_TRANSLATION_PATTERN = 'default_line_translate'
module-attribute
DEFAULT_TRANSLATION_TARGET_TOKENS = 300
module-attribute
FOLLOWING_CONTEXT_MARKER = 'FOLLOWING_CONTEXT'
module-attribute
MAX_RETRIES = 6
module-attribute
MIN_SEGMENT_SIZE = 4
module-attribute
PRECEDING_CONTEXT_MARKER = 'PRECEDING_CONTEXT'
module-attribute
TRANSCRIPT_SEGMENT_MARKER = 'TRANSCRIPT_SEGMENT'
module-attribute
logger = get_child_logger(__name__)
module-attribute
LineTranslator
Translates text line by line while maintaining line numbers and context.
Source code in src/tnh_scholar/ai_text_processing/line_translator.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 | |
context_lines = context_lines
instance-attribute
pattern = pattern
instance-attribute
processor = processor
instance-attribute
review_count = review_count
instance-attribute
style = style
instance-attribute
__init__(processor, pattern, review_count=DEFAULT_REVIEW_COUNT, style=DEFAULT_TRANSLATE_STYLE, context_lines=DEFAULT_TRANSLATE_CONTEXT_LINES)
Initialize line translator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
TextProcessor
|
Implementation of TextProcessor |
required |
pattern
|
Prompt
|
Pattern object containing translation instructions |
required |
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
style
|
str
|
Translation style to apply |
DEFAULT_TRANSLATE_STYLE
|
context_lines
|
int
|
Number of context lines to include before/after |
DEFAULT_TRANSLATE_CONTEXT_LINES
|
Source code in src/tnh_scholar/ai_text_processing/line_translator.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | |
translate_segment(num_text, start_line, end_line, metadata, target_language, source_language, template_dict=None)
Translate a segment of text with context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
Full text to extract segment from |
required | |
start_line
|
int
|
Starting line number of segment |
required |
end_line
|
int
|
Ending line number of segment |
required |
metadata
|
Metadata
|
metadata for text |
required |
source_language
|
str
|
Source language code |
required |
target_language
|
str
|
Target language code (default: en for English) |
required |
template_dict
|
Optional[Dict]
|
Optional additional template values |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Translated text segment with line numbers preserved |
Source code in src/tnh_scholar/ai_text_processing/line_translator.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
translate_text(text, source_language, segment_size=None, target_language=DEFAULT_TARGET_LANGUAGE, template_dict=None)
Translate entire text in segments while maintaining line continuity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Text to translate |
required |
segment_size
|
Optional[int]
|
Number of lines per translation segment |
None
|
source_language
|
str
|
Source language code |
required |
target_language
|
str
|
Target language code (default: en for English) |
DEFAULT_TARGET_LANGUAGE
|
template_dict
|
Optional[Dict]
|
Optional additional template values |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
Complete translated text with line numbers preserved |
Source code in src/tnh_scholar/ai_text_processing/line_translator.py
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | |
translate_text_by_lines(text, source_language=None, target_language=DEFAULT_TARGET_LANGUAGE, pattern=None, model=None, style=None, segment_size=None, context_lines=None, review_count=None, template_dict=None)
Source code in src/tnh_scholar/ai_text_processing/line_translator.py
382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 | |
openai_process_interface
TOKEN_BUFFER = 500
module-attribute
logger = get_child_logger(__name__)
module-attribute
openai_process_text(text_input, process_instructions, model=None, response_format=None, batch=False, max_tokens=0)
postprocessing a transcription.
Source code in src/tnh_scholar/ai_text_processing/openai_process_interface.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
prompts
MANAGER_UPDATE_MESSAGE = 'PromptManager Update:'
module-attribute
MarkdownStr = NewType('MarkdownStr', str)
module-attribute
logger = get_child_logger(__name__)
module-attribute
ConcurrentAccessManager
Manages concurrent access to prompt files.
Provides: - File-level locking - Safe concurrent access prompts - Lock cleanup
Source code in src/tnh_scholar/ai_text_processing/prompts.py
565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 | |
lock_dir = Path(lock_dir)
instance-attribute
__init__(lock_dir)
Initialize access manager.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lock_dir
|
Path
|
Directory for lock files |
required |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
575 576 577 578 579 580 581 582 583 584 | |
file_lock(file_path)
Context manager for safely accessing files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to file to lock |
required |
Yields:
| Type | Description |
|---|---|
|
None when lock is acquired |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If file is already locked |
OSError
|
If lock file operations fail |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 | |
is_locked(file_path)
Check if a file is currently locked.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to file to check |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if file is locked |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 | |
GitBackedRepository
Manages versioned storage of prompts using Git.
Provides basic Git operations while hiding complexity: - Automatic versioning of changes - Basic conflict resolution - History tracking
Source code in src/tnh_scholar/ai_text_processing/prompts.py
337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 | |
repo = Repo(repo_path)
instance-attribute
repo_path = repo_path
instance-attribute
__init__(repo_path)
Initialize or connect to Git repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_path
|
Path
|
Path to repository directory |
required |
Raises:
| Type | Description |
|---|---|
GitCommandError
|
If Git operations fail |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 | |
display_history(file_path, max_versions=0)
Display history of changes for a file with diffs between versions.
Shows most recent changes first, limited to max_versions entries. For each change shows: - Commit info and date - Stats summary of changes - Detailed color diff with 2 lines of context
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to file in repository |
required |
max_versions
|
int
|
Maximum number of versions to show, |
0
|
Example
repo.display_history(Path("prompts/format_dharma_talk.yaml")) Commit abc123def (2024-12-28 14:30:22): 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/prompts/format_dharma_talk.yaml ... ...
Source code in src/tnh_scholar/ai_text_processing/prompts.py
487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 | |
update_file(file_path)
Stage and commit changes to a file in the Git repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Absolute or relative path to the file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Commit hash if changes were made. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the file is outside the repository. |
GitCommandError
|
If Git operations fail. |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 | |
LocalPromptManager
A simple singleton implementation of PromptManager that ensures only one instance is created and reused throughout the application lifecycle.
This class wraps the PromptManager to provide efficient prompt loading by maintaining a single reusable instance.
Attributes:
| Name | Type | Description |
|---|---|---|
_instance |
Optional[SingletonPromptManager]
|
The singleton instance |
_prompt_manager |
Optional[PromptManager]
|
The wrapped PromptManager instance |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 | |
prompt_manager
property
Lazy initialization of the PromptManager instance.
Returns:
| Name | Type | Description |
|---|---|---|
PromptManager |
PromptCatalog
|
The wrapped PromptManager instance |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If PATTERN_REPO is not properly configured |
__new__()
Create or return the singleton instance.
Returns:
| Name | Type | Description |
|---|---|---|
SingletonPromptManager |
LocalPromptManager
|
The singleton instance |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
946 947 948 949 950 951 952 953 954 955 956 | |
get_prompt(name)
Get a prompt by name.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
985 986 987 | |
Prompt
Base Prompt class for version-controlled template prompts.
Prompts contain: - Instructions: The main prompt instructions as a Jinja2 template. Note: Instructions are intended to be saved in markdown format in a .md file. - Template fields: Default values for template variables - Metadata: Name and identifier information
Version control is handled externally through Git, not in the prompt itself. Prompt identity is determined by the combination of identifiers.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the prompt |
instructions |
str
|
The Jinja2 template string for this prompt |
default_template_fields |
Dict[str, str]
|
Default values for template variables |
_allow_empty_vars |
bool
|
Whether to allow undefined template variables |
_env |
Environment
|
Configured Jinja2 environment instance |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | |
default_template_fields = default_template_fields or {}
instance-attribute
instructions = instructions
instance-attribute
name = name
instance-attribute
path = path
instance-attribute
__eq__(other)
Compare prompts based on their content.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
326 327 328 329 330 | |
__hash__()
Hash based on content hash for container operations.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
332 333 334 | |
__init__(name, instructions, path=None, default_template_fields=None, allow_empty_vars=False)
Initialize a new Prompt instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique name identifying the prompt |
required |
instructions
|
MarkdownStr
|
Jinja2 template string containing the prompt |
required |
default_template_fields
|
Optional[Dict[str, str]]
|
Optional default values for template variables |
None
|
allow_empty_vars
|
bool
|
Whether to allow undefined template variables |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If name or instructions are empty |
TemplateError
|
If template syntax is invalid |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
apply_template(field_values=None)
Apply template values to prompt instructions using Jinja2.
Values precedence (highest to lowest): 1. field_values (explicitly passed) 2. frontmatter values (from prompt file) 3. default_template_fields (prompt defaults)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field_values
|
Optional[Dict[str, str]]
|
Values to substitute into the template. If None, uses frontmatter/defaults. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Rendered instructions with template values applied. |
Raises:
| Type | Description |
|---|---|
TemplateError
|
If template rendering fails |
ValueError
|
If required template variables are missing |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
content_hash()
Generate a SHA-256 hash of the prompt content.
Useful for quick content comparison and change detection.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Hexadecimal string of the SHA-256 hash |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 | |
extract_frontmatter()
Extract and validate YAML frontmatter from markdown instructions.
Returns:
| Type | Description |
|---|---|
Optional[Dict[str, Any]]
|
Optional[Dict]: Frontmatter data if found and valid, None otherwise |
Note
Frontmatter must be at the very start of the file and properly formatted.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |
from_dict(data)
classmethod
Create prompt instance from dictionary data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dict[str, Any]
|
Dictionary containing prompt data |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Prompt |
Prompt
|
New prompt instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required fields are missing |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
get_content_without_frontmatter()
Get markdown content with frontmatter removed.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Markdown content without frontmatter |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
223 224 225 226 227 228 229 230 231 | |
source_bytes()
Best-effort raw bytes for prompt hashing.
Prefers hashing exact on-disk bytes including front-matter.
We therefore first try to read from prompt_path. If that fails, we fall back
to hashing the concatenation of known templates. In V1, only
the instructions (system template) are used for rendering.
Source code in src/tnh_scholar/ai_text_processing/prompts.py
256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | |
to_dict()
Convert prompt to dictionary for serialization.
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict containing all prompt data in serializable format |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
288 289 290 291 292 293 294 295 296 297 298 299 | |
update_frontmatter(new_data)
Update or add frontmatter to the markdown content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_data
|
Dict[str, Any]
|
Dictionary of frontmatter fields to update |
required |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
PromptCatalog
Main interface for prompt management system.
Provides high-level operations: - Prompt creation and loading - Automatic versioning - Safe concurrent access - Basic history tracking - Case-insensitive prompt names (stored as lowercase)
Source code in src/tnh_scholar/ai_text_processing/prompts.py
695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 | |
access_manager = ConcurrentAccessManager(self.base_path / '.locks')
instance-attribute
base_path = Path(base_path).resolve()
instance-attribute
repo = GitBackedRepository(self.base_path)
instance-attribute
__init__(base_path)
Initialize prompt management system.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Base directory for prompt storage |
required |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 | |
get_path(prompt_name)
Recursively search for a prompt file with the given name (case-insensitive) in base_path and all subdirectories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_name
|
str
|
prompt name (without extension) to search for |
required |
Returns:
| Type | Description |
|---|---|
Optional[Path]
|
Optional[Path]: Full path to the found prompt file, or None if not found |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 | |
load(prompt_name)
Load the .md prompt file by name, extract placeholders, and return a fully constructed Prompt object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_name
|
str
|
Name of the prompt (without .md extension). |
required |
Returns:
| Type | Description |
|---|---|
Prompt
|
A new Prompt object whose 'instructions' is the file's text |
Prompt
|
and whose 'template_fields' are inferred from placeholders in |
Prompt
|
those instructions. |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 | |
save(prompt, subdir=None)
Source code in src/tnh_scholar/ai_text_processing/prompts.py
782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 | |
show_history(prompt_name)
Source code in src/tnh_scholar/ai_text_processing/prompts.py
852 853 854 855 856 857 | |
verify_repository(base_path)
classmethod
Verify repository integrity and uniqueness of prompt names.
Performs the following checks: 1. Validates Git repository structure. 2. Ensures no duplicate prompt names exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Repository path to verify. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the repository is valid |
bool
|
and contains no duplicate prompt files. |
Source code in src/tnh_scholar/ai_text_processing/prompts.py
873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 | |
response_format
TEXT_SECTIONS_DESCRIPTION = 'Ordered list of logical sections for the text. The sequence of line ranges for the sections must cover every line from start to finish without any overlaps or gaps.'
module-attribute
LogicalSection
Bases: BaseModel
A logically coherent section of text.
Source code in src/tnh_scholar/ai_text_processing/response_format.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
end_line = Field(..., description='Ending line number of the section (inclusive).')
class-attribute
instance-attribute
start_line = Field(..., description='Starting line number of the section (inclusive).')
class-attribute
instance-attribute
title = Field(..., description='Meaningful title for the section in the original language of the section.')
class-attribute
instance-attribute
TextObject
Bases: BaseModel
Represents a text in any language broken into coherent logical sections.
Source code in src/tnh_scholar/ai_text_processing/response_format.py
29 30 31 32 33 34 35 | |
language = Field(..., description='ISO 639-1 language code of the text.')
class-attribute
instance-attribute
sections = Field(..., description=TEXT_SECTIONS_DESCRIPTION)
class-attribute
instance-attribute
section_processor
text_object
StorageFormatType = Union[StorageFormat, Literal['text', 'json']]
module-attribute
logger = get_child_logger(__name__)
module-attribute
AIResponse
Bases: BaseModel
Class for dividing large texts into AI-processable segments while maintaining broader document context.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
document_metadata = Field(..., description='Available Dublin Core standard metadata in human-readable YAML format')
class-attribute
instance-attribute
document_summary = Field(..., description="Concise, comprehensive overview of the text's content and purpose")
class-attribute
instance-attribute
key_concepts = Field(..., description='Important terms, ideas, or references that appear throughout the text')
class-attribute
instance-attribute
language = Field(..., description='ISO 639-1 language code')
class-attribute
instance-attribute
narrative_context = Field(..., description='Concise overview of how the text develops or progresses as a whole')
class-attribute
instance-attribute
sections
instance-attribute
LoadConfig
dataclass
Configuration for loading a TextObject.
Attributes:
| Name | Type | Description |
|---|---|---|
format |
StorageFormat
|
Storage format of the input file |
source_str |
Optional[str]
|
Optional source content as string |
source_file |
Optional[Path]
|
Optional path to source content file |
Note
For JSON format, exactly one of source_str or source_file may be provided. Both fields are ignored for TEXT format.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
format = StorageFormat.TEXT
class-attribute
instance-attribute
source_file = None
class-attribute
instance-attribute
source_str = None
class-attribute
instance-attribute
__init__(format=StorageFormat.TEXT, source_str=None, source_file=None)
__post_init__()
Validate configuration.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
49 50 51 52 53 54 55 56 57 58 | |
get_source_text()
Get source content as text if provided.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
60 61 62 63 64 | |
LogicalSection
Bases: BaseModel
Represents a contextually meaningful segment of a larger text.
Sections should preserve natural breaks in content (explicit section markers, topic shifts, argument development, narrative progression) while staying within specified size limits in order to create chunks suitable for AI processing.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | |
start_line = Field(..., description='Starting line number that begins this logical segment')
class-attribute
instance-attribute
title = Field(..., description="Descriptive title of section's key content")
class-attribute
instance-attribute
SectionEntry
Bases: NamedTuple
Represents a section with its content during iteration.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
72 73 74 75 76 77 | |
content
instance-attribute
number
instance-attribute
range
instance-attribute
title
instance-attribute
SectionObject
dataclass
Represents a section of text with metadata.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
metadata
instance-attribute
section_range
instance-attribute
title
instance-attribute
__init__(title, section_range, metadata)
from_logical_section(logical_section, end_line, metadata=None)
classmethod
Create a SectionObject from a LogicalSection model.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
126 127 128 129 130 131 132 133 134 135 136 137 138 | |
SectionRange
Bases: NamedTuple
Represents the line range of a section.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
67 68 69 70 | |
end
instance-attribute
start
instance-attribute
StorageFormat
Bases: Enum
Source code in src/tnh_scholar/ai_text_processing/text_object.py
26 27 28 | |
JSON = 'json'
class-attribute
instance-attribute
TEXT = 'text'
class-attribute
instance-attribute
TextObject
Manages text content with section organization and metadata tracking.
TextObject serves as the core container for text processing, providing: - Line-numbered text content management - Language identification - Section organization and access - Metadata tracking including incorporated processing stages
The class allows for section boundaries through line numbering, allowing sections to be defined by start lines without explicit end lines. Subsequent sections implicitly end where the next section begins. SectionObjects are utilized to represent sections.
Attributes:
| Name | Type | Description |
|---|---|---|
num_text |
NumberedText
|
Line-numbered text content manager |
language |
str
|
ISO 639-1 language code for the text content |
_sections |
List[SectionObject]
|
Internal list of text sections with boundaries |
_metadata |
Metadata
|
Processing and content metadata container |
Example
content = NumberedText("Line 1\nLine 2\nLine 3") obj = TextObject(content, language="en")
Source code in src/tnh_scholar/ai_text_processing/text_object.py
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 | |
content
property
language = language or get_language_code_from_text(num_text.content)
instance-attribute
last_line_num
property
metadata
property
Access to metadata dictionary.
metadata_str
property
num_text = num_text
instance-attribute
numbered_content
property
section_count
property
sections
property
Access to sections list.
__init__(num_text, language=None, sections=None, metadata=None)
Initialize a TextObject with content and optional organizing components.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_text
|
NumberedText
|
Text content with line numbering |
required |
language
|
Optional[str]
|
ISO 639-1 language code. If None, auto-detected from content |
None
|
sections
|
Optional[List[SectionObject]]
|
Initial sections defining text organization. If None, text is considered un-sectioned. |
None
|
metadata
|
Optional[Metadata]
|
Initial metadata. If None, creates empty metadata container |
None
|
Note
Until sections are established, section-based methods will raise a value error if called.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
__iter__()
Iterate through sections, yielding full section information.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | |
__str__()
Source code in src/tnh_scholar/ai_text_processing/text_object.py
230 231 | |
export_info(source_file=None)
Export serializable state.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
387 388 389 390 391 392 393 394 395 396 397 | |
from_info(info, metadata, num_text)
classmethod
Create TextObject from info and content.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 | |
from_response(response, existing_metadata, num_text)
classmethod
Create TextObject from AI response format.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 | |
from_section_file(section_file, source=None)
classmethod
Create TextObject from a section info file, loading content from source_file. Metadata is extracted from the source_file or from content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
section_file
|
Path
|
Path to JSON file containing TextObjectInfo |
required |
source
|
Optional[str]
|
Optional source string in case no source file is found. |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If source_file is missing from section info |
FileNotFoundError
|
If either section_file or source_file not found |
Source code in src/tnh_scholar/ai_text_processing/text_object.py
425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 | |
from_str(text, language=None, sections=None, metadata=None)
classmethod
Create a TextObject from a string, extracting any frontmatter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text string, potentially containing frontmatter |
required |
language
|
Optional[str]
|
ISO language code |
None
|
sections
|
Optional[List[SectionObject]]
|
List of section objects |
None
|
metadata
|
Optional[Metadata]
|
Optional base metadata to merge with frontmatter |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance with combined metadata |
Source code in src/tnh_scholar/ai_text_processing/text_object.py
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | |
from_text_file(file)
classmethod
Source code in src/tnh_scholar/ai_text_processing/text_object.py
417 418 419 420 421 422 423 | |
get_section_content(index)
Source code in src/tnh_scholar/ai_text_processing/text_object.py
374 375 376 377 378 379 380 381 382 383 384 385 | |
load(path, config=None)
classmethod
Load TextObject from file with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Input file path |
required |
config
|
Optional[LoadConfig]
|
Optional loading configuration. If not provided, loads directly from text file. |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance |
Usage
Load from text file with frontmatter
obj = TextObject.load(Path("content.txt"))
Load state from JSON with source content string
config = LoadConfig( format=StorageFormat.JSON, source_content="Text content..." ) obj = TextObject.load(Path("state.json"), config)
Load state from JSON with source content file
config = LoadConfig( format=StorageFormat.JSON, source_content=Path("content.txt") ) obj = TextObject.load(Path("state.json"), config)
Source code in src/tnh_scholar/ai_text_processing/text_object.py
505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 | |
merge_metadata(new_metadata, override=False)
Merge new metadata with existing metadata.
For now, performs simple dict-like union (|=) but can be extended to handle more complex merging logic in the future (e.g., merging nested structures, handling conflicts, merging arrays).
Args: new_metadata: Metadata to merge with existing metadata override: If True, new_metadata values override existing values If False, existing values are preserved
Source code in src/tnh_scholar/ai_text_processing/text_object.py
324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 | |
save(path, output_format=StorageFormat.TEXT, source_file=None, pretty=True)
Save TextObject to file in specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path |
required |
format
|
"text" for full content+metadata or "json" for serialized state |
required | |
source_file
|
Optional[Path]
|
Optional source file to record in metadata |
None
|
pretty
|
bool
|
For JSON output, whether to pretty print |
True
|
Source code in src/tnh_scholar/ai_text_processing/text_object.py
476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 | |
transform(data_str=None, language=None, metadata=None, process_metadata=None, sections=None)
Update TextObject content and metadata in place.
Optionally modifies the object's content, language, and adds process tracking. Process history is maintained in metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
New text content |
required | |
language
|
Optional[str]
|
New language code |
None
|
process_tag
|
Identifier for the process performed |
required |
Source code in src/tnh_scholar/ai_text_processing/text_object.py
552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 | |
update_metadata(**kwargs)
Update metadata with new key-value pairs.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
354 355 356 357 | |
validate_sections()
Basic validation of section integrity.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
359 360 361 362 363 364 365 366 367 368 369 370 371 372 | |
TextObjectInfo
Bases: BaseModel
Serializable information about a text and its sections.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
141 142 143 144 145 146 147 148 149 150 151 152 153 | |
language
instance-attribute
metadata
instance-attribute
sections
instance-attribute
source_file = None
class-attribute
instance-attribute
model_post_init(__context)
Ensure metadata is always a Metadata instance after initialization.
Source code in src/tnh_scholar/ai_text_processing/text_object.py
148 149 150 151 152 153 | |
typing
ProcessorResult = Union[str, ResponseFormat]
module-attribute
ResponseFormat = TypeVar('ResponseFormat', bound=BaseModel)
module-attribute
audio_processing
__all__ = ['DiarizationConfig', 'detect_nonsilent', 'detect_whisper_boundaries', 'split_audio', 'split_audio_at_boundaries']
module-attribute
DiarizationConfig
Bases: BaseSettings
Source code in src/tnh_scholar/audio_processing/diarization/config.py
148 149 150 151 152 153 154 155 156 157 158 159 | |
chunk = ChunkConfig()
class-attribute
instance-attribute
language = LanguageConfig()
class-attribute
instance-attribute
mapping = MappingPolicy()
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='DIARIZATION_', extra='ignore')
class-attribute
instance-attribute
speaker = SpeakerConfig()
class-attribute
instance-attribute
detect_whisper_boundaries(audio_file, model_size='tiny', language=None)
Detect sentence boundaries using a Whisper model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
Path to the audio file. |
required |
model_size
|
str
|
Whisper model size. |
'tiny'
|
language
|
str
|
Language to force for transcription (e.g. 'en', 'vi'), or None for auto. |
None
|
Returns:
| Type | Description |
|---|---|
List[Boundary]
|
List[Boundary]: A list of sentence boundaries with text. |
Example
boundaries = detect_whisper_boundaries(Path("my_audio.mp3"), model_size="tiny") for b in boundaries: ... print(b.start, b.end, b.text)
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
split_audio(audio_file, method='whisper', output_dir=None, model_size='tiny', language=None, min_silence_len=MIN_SILENCE_LENGTH, silence_thresh=SILENCE_DBFS_THRESHOLD, max_duration=MAX_DURATION)
High-level function to split an audio file into chunks based on a chosen method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
The input audio file. |
required |
method
|
str
|
Splitting method, "silence" or "whisper". |
'whisper'
|
output_dir
|
Path
|
Directory to store output. |
None
|
model_size
|
str
|
Whisper model size if method='whisper'. |
'tiny'
|
language
|
str
|
Language for whisper transcription if method='whisper'. |
None
|
min_silence_len
|
int
|
For silence-based detection, min silence length in ms. |
MIN_SILENCE_LENGTH
|
silence_thresh
|
int
|
Silence threshold in dBFS. |
SILENCE_DBFS_THRESHOLD
|
max_duration_s
|
int
|
Max chunk length in seconds. |
required |
max_duration_ms
|
int
|
Max chunk length in ms (for silence detection combination). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Directory containing the resulting chunks. |
Example
Split using silence detection
split_audio(Path("my_audio.mp3"), method="silence")
Split using whisper-based sentence boundaries
split_audio(Path("my_audio.mp3"), method="whisper", model_size="base", language="en")
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | |
split_audio_at_boundaries(audio_file, boundaries, output_dir=None, max_duration=MAX_DURATION)
Split the audio file into chunks based on provided boundaries, ensuring all audio is included and boundaries align with the start of Whisper segments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
The input audio file. |
required |
boundaries
|
List[Boundary]
|
Detected boundaries. |
required |
output_dir
|
Path
|
Directory to store the resulting chunks. |
None
|
max_duration
|
int
|
Maximum chunk length in seconds. |
MAX_DURATION
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Directory containing the chunked audio files. |
Example
boundaries = [Boundary(34.02, 37.26, "..."), Boundary(38.0, 41.18, "...")] out_dir = split_audio_at_boundaries(Path("my_audio.mp3"), boundaries)
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | |
audio_legacy
EXPECTED_TIME_FACTOR = 0.45
module-attribute
MAX_DURATION = 10 * 60
module-attribute
MAX_DURATION_MS = 10 * 60 * 1000
module-attribute
MAX_INT16 = 32768.0
module-attribute
MIN_SILENCE_LENGTH = 1000
module-attribute
SEEK_LENGTH = 50
module-attribute
SILENCE_DBFS_THRESHOLD = -30
module-attribute
logger = get_child_logger('audio_processing')
module-attribute
Boundary
dataclass
A data structure representing a detected audio boundary.
Attributes:
| Name | Type | Description |
|---|---|---|
start |
float
|
Start time of the segment in seconds. |
end |
float
|
End time of the segment in seconds. |
text |
str
|
Associated text (empty if silence-based). |
Example
b = Boundary(start=0.0, end=30.0, text="Hello world") b.start, b.end, b.text (0.0, 30.0, 'Hello world')
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
end
instance-attribute
start
instance-attribute
text = ''
class-attribute
instance-attribute
__init__(start, end, text='')
audio_to_numpy(audio_segment)
Convert an AudioSegment object to a NumPy array suitable for Whisper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_segment
|
AudioSegment
|
The input audio segment to convert. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: A mono-channel NumPy array normalized to the range [-1, 1]. |
Example
audio = AudioSegment.from_file("example.mp3") audio_numpy = audio_to_numpy(audio)
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 | |
detect_silence_boundaries(audio_file, min_silence_len=MIN_SILENCE_LENGTH, silence_thresh=SILENCE_DBFS_THRESHOLD, max_duration=MAX_DURATION_MS)
Detect boundaries (start/end times) based on silence detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
Path to the audio file. |
required |
min_silence_len
|
int
|
Minimum silence length to consider for splitting (ms). |
MIN_SILENCE_LENGTH
|
silence_thresh
|
int
|
Silence threshold in dBFS. |
SILENCE_DBFS_THRESHOLD
|
max_duration
|
int
|
Maximum duration of any segment (ms). |
MAX_DURATION_MS
|
Returns:
| Type | Description |
|---|---|
Tuple[List[Boundary], Dict]
|
List[Boundary]: A list of boundaries with empty text. |
Example
boundaries = detect_silence_boundaries(Path("my_audio.mp3")) for b in boundaries: ... print(b.start, b.end)
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
detect_whisper_boundaries(audio_file, model_size='tiny', language=None)
Detect sentence boundaries using a Whisper model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
Path to the audio file. |
required |
model_size
|
str
|
Whisper model size. |
'tiny'
|
language
|
str
|
Language to force for transcription (e.g. 'en', 'vi'), or None for auto. |
None
|
Returns:
| Type | Description |
|---|---|
List[Boundary]
|
List[Boundary]: A list of sentence boundaries with text. |
Example
boundaries = detect_whisper_boundaries(Path("my_audio.mp3"), model_size="tiny") for b in boundaries: ... print(b.start, b.end, b.text)
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
split_audio(audio_file, method='whisper', output_dir=None, model_size='tiny', language=None, min_silence_len=MIN_SILENCE_LENGTH, silence_thresh=SILENCE_DBFS_THRESHOLD, max_duration=MAX_DURATION)
High-level function to split an audio file into chunks based on a chosen method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
The input audio file. |
required |
method
|
str
|
Splitting method, "silence" or "whisper". |
'whisper'
|
output_dir
|
Path
|
Directory to store output. |
None
|
model_size
|
str
|
Whisper model size if method='whisper'. |
'tiny'
|
language
|
str
|
Language for whisper transcription if method='whisper'. |
None
|
min_silence_len
|
int
|
For silence-based detection, min silence length in ms. |
MIN_SILENCE_LENGTH
|
silence_thresh
|
int
|
Silence threshold in dBFS. |
SILENCE_DBFS_THRESHOLD
|
max_duration_s
|
int
|
Max chunk length in seconds. |
required |
max_duration_ms
|
int
|
Max chunk length in ms (for silence detection combination). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Directory containing the resulting chunks. |
Example
Split using silence detection
split_audio(Path("my_audio.mp3"), method="silence")
Split using whisper-based sentence boundaries
split_audio(Path("my_audio.mp3"), method="whisper", model_size="base", language="en")
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | |
split_audio_at_boundaries(audio_file, boundaries, output_dir=None, max_duration=MAX_DURATION)
Split the audio file into chunks based on provided boundaries, ensuring all audio is included and boundaries align with the start of Whisper segments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
The input audio file. |
required |
boundaries
|
List[Boundary]
|
Detected boundaries. |
required |
output_dir
|
Path
|
Directory to store the resulting chunks. |
None
|
max_duration
|
int
|
Maximum chunk length in seconds. |
MAX_DURATION
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Directory containing the chunked audio files. |
Example
boundaries = [Boundary(34.02, 37.26, "..."), Boundary(38.0, 41.18, "...")] out_dir = split_audio_at_boundaries(Path("my_audio.mp3"), boundaries)
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | |
whisper_model_transcribe(model, input_source, *args, **kwargs)
Wrapper around model.transcribe that suppresses the known 'FP16 is not supported on CPU; using FP32 instead' UserWarning and redirects unwanted 'OMP' messages to prevent interference.
This function accepts all args and kwargs that model.transcribe normally does, and supports input sources as file paths (str or Path) or in-memory audio arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
The Whisper model instance. |
required |
input_source
|
Union[str, Path, ndarray]
|
Input audio file path, URL, or in-memory audio array. |
required |
*args
|
Additional positional arguments for model.transcribe. |
()
|
|
**kwargs
|
Additional keyword arguments for model.transcribe. |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict[str, Any]: Transcription result from model.transcribe. |
Example
Using a file path
result = whisper_model_transcribe(my_model, "sample_audio.mp3", verbose=True)
Using an audio array
result = whisper_model_transcribe(my_model, audio_array, language="en")
Source code in src/tnh_scholar/audio_processing/audio_legacy.py
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 | |
diarization
__all__ = ['DiarizationProcessor', 'diarize', 'diarize_to_file', 'DiarizationParams', 'PyannoteClient', 'PyannoteConfig']
module-attribute
DiarizationParams
Bases: BaseModel
Per-request diarization options; maps to pyannote API payload. Use .to_api_dict() to emit API field names.
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
confidence = Field(default=None, ge=0.0, le=1.0, description='Confidence threshold for segments.')
class-attribute
instance-attribute
model_config = ConfigDict(frozen=True, populate_by_name=True, extra='forbid')
class-attribute
instance-attribute
num_speakers = Field(default=None, alias='numSpeakers', description="Fixed number of speakers or 'auto' for detection.")
class-attribute
instance-attribute
webhook = Field(default=None, description='Webhook URL for job status callbacks.')
class-attribute
instance-attribute
to_api_dict()
Return payload dict using API field names (camelCase) and excluding Nones.
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
65 66 67 | |
DiarizationProcessor
Orchestrator over a DiarizationService.
This layer delegates to the service for generation and handles persistence.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | |
audio_file_path = audio_file_path.resolve()
instance-attribute
output_path = output_path.resolve() if output_path is not None else self.audio_file_path.parent / f'{self.audio_file_path.stem}{PYANNOTE_FILE_STR}.json'
instance-attribute
params = params
instance-attribute
service = service or PyannoteService(default_client)
instance-attribute
writer = writer or FileResultWriter()
instance-attribute
__init__(audio_file_path, output_path=None, *, service=None, params=None, api_key=None, writer=None)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
export(response=None)
Write the provided or last response to self.output_path.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
169 170 171 172 173 174 175 176 | |
generate(*, wait_until_complete=True)
One-shot convenience: delegate to the service and cache the response.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
get_response(job=None, *, wait_until_complete=False)
Fetch current/final response for a job, caching the last response.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
start()
Start a job and cache its job_id.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
122 123 124 125 126 127 128 | |
PyannoteClient
Client for interacting with the pyannote.ai speaker diarization API.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 | |
api_key = api_key or os.getenv('PYANNOTEAI_API_TOKEN')
instance-attribute
config = config or PyannoteConfig()
instance-attribute
headers = {'Authorization': f'Bearer {self.api_key}'}
instance-attribute
network_timeout = self.config.network_timeout
instance-attribute
polling_config = self.config.polling_config
instance-attribute
upload_max_retries = self.config.upload_max_retries
instance-attribute
upload_timeout = self.config.upload_timeout
instance-attribute
JobPoller
Generic job polling helper for long-running async jobs.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 | |
job_id = job_id
instance-attribute
last_status = None
instance-attribute
poll_count = 0
instance-attribute
polling_config = polling_config
instance-attribute
start_time = time.time()
instance-attribute
status_fn = status_fn
instance-attribute
__init__(status_fn, job_id, polling_config)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
294 295 296 297 298 299 300 301 | |
run()
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 | |
__init__(api_key=None, config=None)
Initialize with API key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
Pyannote.ai API key (defaults to environment variable) |
None
|
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | |
check_job_status(job_id)
Check the status of a diarization job.
Returns a typed transport model (JobStatusResponse) or None on failure.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
244 245 246 247 248 249 250 | |
poll_job_until_complete(job_id, estimated_duration=None, timeout=None, wait_until_complete=False)
Poll until the job reaches a terminal state or a client-side stop condition, and
return a unified JobStatusResponse (JSR) that includes both the server payload
and polling context via outcome, polls, and elapsed_s.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
Remote job identifier to poll. |
required |
estimated_duration
|
Optional[float]
|
Optional hint; currently unused (reserved for adaptive backoff). |
None
|
timeout
|
Optional[float]
|
Optional hard timeout in seconds for this poll call. If provided, it overrides
the client's default polling timeout. Ignored if |
None
|
wait_until_complete
|
Optional[bool]
|
If True, ignore timeout and poll indefinitely (subject to process lifetime). |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
JobStatusResponse |
JobStatusResponse
|
unified transport + polling-context result. |
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 | |
start_diarization(media_id, params=None)
Start diarization job with pyannote.ai API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
media_id
|
str
|
The media ID from upload_audio |
required |
params
|
Optional[DiarizationParams]
|
Optional parameters for diarization |
None
|
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]: The job ID if started successfully, None otherwise |
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | |
upload_audio(file_path)
Upload audio file with retry logic for network robustness.
Retries on network errors with exponential backoff. Fails fast on permanent errors (auth, file not found, etc.).
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
PyannoteConfig
Bases: BaseSettings
Configuration constants for Pyannote API.
Source code in src/tnh_scholar/audio_processing/diarization/config.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
base_url = 'https://api.pyannote.ai/v1'
class-attribute
instance-attribute
diarize_endpoint
property
job_status_endpoint
property
media_content_type = 'audio/mpeg'
class-attribute
instance-attribute
media_input_endpoint
property
media_prefix = 'media://diarization-'
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_', extra='ignore')
class-attribute
instance-attribute
network_timeout = 3
class-attribute
instance-attribute
polling_config = PollingConfig()
class-attribute
instance-attribute
upload_max_retries = 3
class-attribute
instance-attribute
upload_timeout = 300
class-attribute
instance-attribute
diarize(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)
One-shot convenience to generate a result and (optionally) write it.
This returns the DiarizationResponse. Writing is left to callers or
diarize_to_file below.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | |
diarize_to_file(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)
Convenience helper: generate then export to JSON if successful; returns response
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | |
audio
__all__ = ['AudioHandler', 'AudioHandlerConfig']
module-attribute
AudioHandler
Isolates audio operations and external dependencies (pydub, ffmpeg).
Source code in src/tnh_scholar/audio_processing/diarization/audio/handler.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | |
base_audio
instance-attribute
config = config
instance-attribute
input_format = None
instance-attribute
output_format = config.output_format
instance-attribute
__init__(config=AudioHandlerConfig())
Source code in src/tnh_scholar/audio_processing/diarization/audio/handler.py
35 36 37 38 39 40 41 42 43 | |
build_audio_chunk(chunk, audio_file)
builds and sets the internal chunk.audio to be the new AudioChunk
Source code in src/tnh_scholar/audio_processing/diarization/audio/handler.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
export_audio_bytes(audio_segment, format_str=None)
Export AudioSegment to BytesIO for services/modules that require file-like objects.
Source code in src/tnh_scholar/audio_processing/diarization/audio/handler.py
62 63 64 | |
AudioHandlerConfig
Bases: BaseSettings
Configuration settings for the AudioHandler. All audio time units are milliseconds (int)
Source code in src/tnh_scholar/audio_processing/diarization/audio/config.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
SUPPORTED_FORMATS = frozenset({'mp3', 'wav', 'flac', 'ogg', 'm4a', 'mp4'})
class-attribute
instance-attribute
max_segment_length = Field(default=None, description='Maximum allowed segment length (in milliseconds).')
class-attribute
instance-attribute
output_format = Field(default=None, description="Audio output format used when exporting segments (e.g., 'wav', 'mp3').")
class-attribute
instance-attribute
silence_all_intervals = Field(default=False, description='If True, replace every non-zero interval between consecutive diarization segments with silence of length spacing_time.')
class-attribute
instance-attribute
temp_storage_dir = Field(default=None, description='Optional directory path for storing temporary audio files (currently unused).')
class-attribute
instance-attribute
Config
Source code in src/tnh_scholar/audio_processing/diarization/audio/config.py
34 35 | |
env_prefix = 'AUDIO_HANDLER_'
class-attribute
instance-attribute
config
AudioHandlerConfig
Bases: BaseSettings
Configuration settings for the AudioHandler. All audio time units are milliseconds (int)
Source code in src/tnh_scholar/audio_processing/diarization/audio/config.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
SUPPORTED_FORMATS = frozenset({'mp3', 'wav', 'flac', 'ogg', 'm4a', 'mp4'})
class-attribute
instance-attribute
max_segment_length = Field(default=None, description='Maximum allowed segment length (in milliseconds).')
class-attribute
instance-attribute
output_format = Field(default=None, description="Audio output format used when exporting segments (e.g., 'wav', 'mp3').")
class-attribute
instance-attribute
silence_all_intervals = Field(default=False, description='If True, replace every non-zero interval between consecutive diarization segments with silence of length spacing_time.')
class-attribute
instance-attribute
temp_storage_dir = Field(default=None, description='Optional directory path for storing temporary audio files (currently unused).')
class-attribute
instance-attribute
Config
Source code in src/tnh_scholar/audio_processing/diarization/audio/config.py
34 35 | |
env_prefix = 'AUDIO_HANDLER_'
class-attribute
instance-attribute
handler
Audio handler utilities for slicing and assembling audio around diarization chunks. Designed for pipeline-friendly, single-responsibility methods so that higher-level services can remain agnostic of the underlying audio library.
This implementation purposely keeps logic minimal for testing.
logger = get_child_logger(__name__)
module-attribute
AudioHandler
Isolates audio operations and external dependencies (pydub, ffmpeg).
Source code in src/tnh_scholar/audio_processing/diarization/audio/handler.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | |
base_audio
instance-attribute
config = config
instance-attribute
input_format = None
instance-attribute
output_format = config.output_format
instance-attribute
__init__(config=AudioHandlerConfig())
Source code in src/tnh_scholar/audio_processing/diarization/audio/handler.py
35 36 37 38 39 40 41 42 43 | |
build_audio_chunk(chunk, audio_file)
builds and sets the internal chunk.audio to be the new AudioChunk
Source code in src/tnh_scholar/audio_processing/diarization/audio/handler.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
export_audio_bytes(audio_segment, format_str=None)
Export AudioSegment to BytesIO for services/modules that require file-like objects.
Source code in src/tnh_scholar/audio_processing/diarization/audio/handler.py
62 63 64 | |
chunker
logger = get_child_logger(__name__)
module-attribute
DiarizationChunker
Class for chunking diarization results into processing units based on configurable duration targets.
Source code in src/tnh_scholar/audio_processing/diarization/chunker.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | |
config = ChunkConfig()
instance-attribute
__init__(**config_options)
Initialize chunker with additional config_options.
Source code in src/tnh_scholar/audio_processing/diarization/chunker.py
20 21 22 23 24 | |
extract_contiguous_chunks(segments)
Split diarization segments into contiguous chunks of approximately target_duration, without splitting on speaker changes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segments
|
List[DiarizedSegment]
|
List of speaker segments from diarization |
required |
Returns:
| Type | Description |
|---|---|
List[DiarizationChunk]
|
List[Chunk]: Flat list of contiguous chunks |
Source code in src/tnh_scholar/audio_processing/diarization/chunker.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
config
ChunkConfig
Bases: BaseSettings
Configuration for chunking
Source code in src/tnh_scholar/audio_processing/diarization/config.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
gap_spacing_time = 1000
class-attribute
instance-attribute
gap_threshold = 4000
class-attribute
instance-attribute
min_duration = 30000
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='CHUNK_', extra='ignore')
class-attribute
instance-attribute
target_duration = 300000
class-attribute
instance-attribute
DiarizationConfig
Bases: BaseSettings
Source code in src/tnh_scholar/audio_processing/diarization/config.py
148 149 150 151 152 153 154 155 156 157 158 159 | |
chunk = ChunkConfig()
class-attribute
instance-attribute
language = LanguageConfig()
class-attribute
instance-attribute
mapping = MappingPolicy()
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='DIARIZATION_', extra='ignore')
class-attribute
instance-attribute
speaker = SpeakerConfig()
class-attribute
instance-attribute
LanguageConfig
Bases: BaseSettings
Source code in src/tnh_scholar/audio_processing/diarization/config.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | |
default_language = 'en'
class-attribute
instance-attribute
export_format = 'wav'
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='LANGUAGE_', extra='ignore')
class-attribute
instance-attribute
probe_time = 10000
class-attribute
instance-attribute
MappingPolicy
Bases: BaseSettings
Mapping policy for transport→domain shaping.
TODO (future parameters to consider): - min_segment_ms: int # drop micro-segments below threshold - merge_gap_ms: int # merge adjacent same-speaker if gap ≤ this - round_ms_to: int # quantize boundaries (e.g., 10ms) - confidence_floor: float | None # filter out low-confidence segments - suppress_unlabeled: bool # drop segments missing speaker id - attach_raw_payload: bool # persist raw API payload in metadata - version: int # policy versioning for reproducibility
Source code in src/tnh_scholar/audio_processing/diarization/config.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
default_speaker_label = 'SPEAKER_00'
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='MAPPING_', extra='ignore')
class-attribute
instance-attribute
single_speaker = False
class-attribute
instance-attribute
PollingConfig
Bases: BaseSettings
Configuration constants for a generic polling class used to for Pyannote API polling.
Source code in src/tnh_scholar/audio_processing/diarization/config.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
exp_base = 2
class-attribute
instance-attribute
initial_poll_time = 7
class-attribute
instance-attribute
max_interval = 30
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_POLL_', extra='ignore')
class-attribute
instance-attribute
polling_interval = 15
class-attribute
instance-attribute
polling_timeout = 300.0
class-attribute
instance-attribute
PyannoteConfig
Bases: BaseSettings
Configuration constants for Pyannote API.
Source code in src/tnh_scholar/audio_processing/diarization/config.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
base_url = 'https://api.pyannote.ai/v1'
class-attribute
instance-attribute
diarize_endpoint
property
job_status_endpoint
property
media_content_type = 'audio/mpeg'
class-attribute
instance-attribute
media_input_endpoint
property
media_prefix = 'media://diarization-'
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_', extra='ignore')
class-attribute
instance-attribute
network_timeout = 3
class-attribute
instance-attribute
polling_config = PollingConfig()
class-attribute
instance-attribute
upload_max_retries = 3
class-attribute
instance-attribute
upload_timeout = 300
class-attribute
instance-attribute
SpeakerConfig
Bases: BaseSettings
Configuration settings for speaker block generation.
Source code in src/tnh_scholar/audio_processing/diarization/config.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
default_speaker_label = 'SPEAKER_00'
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='SPEAKER_', extra='ignore')
class-attribute
instance-attribute
same_speaker_gap_threshold = TimeMs.from_seconds(2)
class-attribute
instance-attribute
single_speaker = False
class-attribute
instance-attribute
models
logger = get_child_logger(__name__)
module-attribute
AudioChunk
Bases: BaseModel
Source code in src/tnh_scholar/audio_processing/diarization/models.py
176 177 178 179 180 181 182 183 184 185 | |
channels = None
class-attribute
instance-attribute
data
instance-attribute
end_ms
instance-attribute
format = None
class-attribute
instance-attribute
sample_rate = None
class-attribute
instance-attribute
start_ms
instance-attribute
Config
Source code in src/tnh_scholar/audio_processing/diarization/models.py
184 185 | |
arbitrary_types_allowed = True
class-attribute
instance-attribute
AugDiarizedSegment
Bases: DiarizedSegment
DiarizedSegment with additional chunking/processing metadata.
This class extends DiarizationSegment and adds fields that are only set during
chunk accumulation or downstream processing.
Attributes:
| Name | Type | Description |
|---|---|---|
gap_before |
bool
|
Indicates if there is a gap greater than the configured threshold before this segment. Set only during chunk accumulation. |
spacing_time |
TimeMs
|
The spacing (in ms) between this and the previous segment, possibly adjusted if there is a gap before. Set only during chunk accumulation. |
audio |
TNHAudioSegment
|
The audio data for this segment, sliced from the original audio. |
Notes
- The
audiofield is a slice of the original audio corresponding to this segment. - All time values (start, end, duration) are relative to the original audio.
- When slicing or probing the
audiofield, use times relative to 0 (i.e., 0 to duration). - For language probing or any operation on
audio, always use 0 as the start anddurationas the end.
Source code in src/tnh_scholar/audio_processing/diarization/models.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | |
audio
instance-attribute
gap_before_new
instance-attribute
relative_end
property
End time relative to the segment audio (duration of segment).
relative_start
property
Start time relative to the segment audio (always 0).
spacing_time_new
instance-attribute
Config
Source code in src/tnh_scholar/audio_processing/diarization/models.py
172 173 | |
arbitrary_types_allowed = True
class-attribute
instance-attribute
from_segment(segment, gap_before=None, spacing_time_new=None, audio=None, **kwargs)
classmethod
Create an AugDiarizedSegment from a DiarizedSegment, with optional new fields. Args: segment (DiarizedSegment): The base segment to copy fields from. gap_before_new (bool, optional): Value for gap_before_new. Defaults to False. spacing_time_new (TimeMs, optional): Value for spacing_time_new. Defaults to None. audio (AudioSegment, optional): Audio data for this segment. Defaults to None. **kwargs: Any additional fields to override. Returns: AugDiarizedSegment: The new augmented segment.
Source code in src/tnh_scholar/audio_processing/diarization/models.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | |
DiarizationChunk
Bases: BaseModel
Represents a chunk of segments to be processed together.
Source code in src/tnh_scholar/audio_processing/diarization/models.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | |
accumulated_time = 0
class-attribute
instance-attribute
audio = None
class-attribute
instance-attribute
end_time
instance-attribute
segments
instance-attribute
start_time
instance-attribute
total_duration
property
Get chunk duration in milliseconds.
total_duration_sec
property
total_duration_time
property
Config
Source code in src/tnh_scholar/audio_processing/diarization/models.py
195 196 | |
arbitrary_types_allowed = True
class-attribute
instance-attribute
DiarizedSegment
Bases: BaseModel
Represents a diarized audio segment for a single speaker.
Attributes:
| Name | Type | Description |
|---|---|---|
speaker |
str
|
The speaker label for this segment. |
start |
TimeMs
|
Start time in milliseconds. |
end |
TimeMs
|
End time in milliseconds. |
audio_map_start |
Optional[int]
|
Location in the audio output file, if mapped. |
gap_before |
Optional[bool]
|
Indicates if there is a gap greater than the configured threshold
before this segment. This attribute is set exclusively by |
spacing_time |
Optional[int]
|
The spacing (in ms) between this and the previous segment,
possibly adjusted if there is a gap before. This attribute is also set exclusively by
|
Notes
gap_beforeandspacing_timeare not set during initial diarization, but are assigned only when the segment is accumulated into a chunk for downstream audio handling.- These fields should be considered write-once and must not be mutated elsewhere.
Source code in src/tnh_scholar/audio_processing/diarization/models.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
audio_map_start
instance-attribute
duration
property
Get segment duration in milliseconds.
duration_sec
property
end
instance-attribute
end_time
property
gap_before
instance-attribute
mapped_end
property
mapped_start
property
Downstream registry field set by the audio handler
spacing_time
instance-attribute
speaker
instance-attribute
start
instance-attribute
start_time
property
normalize()
Normalize the duration of the segment to be nonzero and validate start/end values.
Source code in src/tnh_scholar/audio_processing/diarization/models.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
SpeakerBlock
Bases: BaseModel
A block of contiguous or near-contiguous segments spoken by the same speaker.
Used as a higher-level abstraction over diarization segments to simplify chunking strategies (e.g., language-aware sampling, re-segmentation).
Source code in src/tnh_scholar/audio_processing/diarization/models.py
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 | |
duration
property
duration_sec
property
end
property
segment_count
property
segments
instance-attribute
speaker
instance-attribute
start
property
Config
Source code in src/tnh_scholar/audio_processing/diarization/models.py
221 222 | |
arbitrary_types_allowed = True
class-attribute
instance-attribute
from_dict(data)
classmethod
Create a SpeakerBlock from a dictionary (output of to_dict). Args: data (dict): Dictionary with keys matching SpeakerBlock fields. Returns: SpeakerBlock: Deserialized SpeakerBlock instance. Raises: ValueError, TypeError: If validation fails.
Source code in src/tnh_scholar/audio_processing/diarization/models.py
282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 | |
to_dict()
custom serializer for SpeakerBlock with validation.
Source code in src/tnh_scholar/audio_processing/diarization/models.py
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 | |
protocols
Interfaces shared by diarization strategy classes.
AudioFetcher
Bases: Protocol
Abstract audio provider for probing a segment.
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
32 33 34 35 | |
extract_audio(start_ms, end_ms)
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
35 | |
ChunkingStrategy
Bases: Protocol
Protocol every chunking strategy must satisfy.
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
24 25 26 27 28 29 | |
extract(segments)
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
29 | |
DiarizationService
Bases: Protocol
Protocol for any diarization service.
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | |
generate(audio_path, params=None, *, wait_until_complete=True)
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
60 61 62 63 64 65 66 67 68 69 70 71 72 | |
get_response(job_id, *, wait_until_complete=False)
Return the current state or final result as a DiarizationResponse.
When wait_until_complete is True, the service blocks until a terminal
state (succeeded/failed/timeout) and returns that envelope.
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
52 53 54 55 56 57 58 | |
start(audio_path, params=None)
Start a diarization job and return an opaque job_id.
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
47 48 49 | |
LanguageDetector
Bases: Protocol
Abstract language detector (e.g., fastText, Whisper-lang).
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
38 39 40 41 | |
detect(audio, format_str)
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
41 | |
ResultWriter
Bases: Protocol
Port for persisting diarization results.
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
74 75 76 77 78 | |
write(path, response)
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
77 78 | |
SegmentAdapter
Bases: Protocol
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
17 18 19 20 21 | |
to_segments(data)
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
18 19 20 21 | |
pyannote_adapter
logger = get_child_logger(__name__)
module-attribute
PyannoteAdapter
Bases: SegmentAdapter
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_adapter.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | |
config = config
instance-attribute
__init__(config=DiarizationConfig())
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_adapter.py
27 28 | |
failed_start()
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_adapter.py
220 221 222 223 224 225 226 227 228 229 230 | |
to_response(jsr)
Convert a JobStatusResponse to a DiarizationResponse (domain layer).
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_adapter.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
to_segments(data)
Convert a pyannoteai diarization result dict to list of DiarizationSegment objects.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_adapter.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | |
pyannote_client
pyannote_client.py
Client interface for interacting with the pyannote.ai speaker diarization API.
This module provides a robust, object-oriented client for uploading audio files, starting diarization jobs, polling for job completion, and retrieving results from the pyannote.ai API. It includes retry logic, configurable timeouts, and support for advanced diarization parameters.
Typical usage
client = PyannoteClient(api_key="your_api_key") media_id = client.upload_audio(Path("audio.mp3")) job_id = client.start_diarization(media_id) result = client.poll_job_until_complete(job_id)
JOB_ID_FIELD = 'jobId'
module-attribute
logger = get_child_logger(__name__)
module-attribute
APIKeyError
Bases: Exception
Raised when API key is missing or invalid.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
53 54 | |
PyannoteClient
Client for interacting with the pyannote.ai speaker diarization API.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 | |
api_key = api_key or os.getenv('PYANNOTEAI_API_TOKEN')
instance-attribute
config = config or PyannoteConfig()
instance-attribute
headers = {'Authorization': f'Bearer {self.api_key}'}
instance-attribute
network_timeout = self.config.network_timeout
instance-attribute
polling_config = self.config.polling_config
instance-attribute
upload_max_retries = self.config.upload_max_retries
instance-attribute
upload_timeout = self.config.upload_timeout
instance-attribute
JobPoller
Generic job polling helper for long-running async jobs.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 | |
job_id = job_id
instance-attribute
last_status = None
instance-attribute
poll_count = 0
instance-attribute
polling_config = polling_config
instance-attribute
start_time = time.time()
instance-attribute
status_fn = status_fn
instance-attribute
__init__(status_fn, job_id, polling_config)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
294 295 296 297 298 299 300 301 | |
run()
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 | |
__init__(api_key=None, config=None)
Initialize with API key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
Pyannote.ai API key (defaults to environment variable) |
None
|
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | |
check_job_status(job_id)
Check the status of a diarization job.
Returns a typed transport model (JobStatusResponse) or None on failure.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
244 245 246 247 248 249 250 | |
poll_job_until_complete(job_id, estimated_duration=None, timeout=None, wait_until_complete=False)
Poll until the job reaches a terminal state or a client-side stop condition, and
return a unified JobStatusResponse (JSR) that includes both the server payload
and polling context via outcome, polls, and elapsed_s.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
Remote job identifier to poll. |
required |
estimated_duration
|
Optional[float]
|
Optional hint; currently unused (reserved for adaptive backoff). |
None
|
timeout
|
Optional[float]
|
Optional hard timeout in seconds for this poll call. If provided, it overrides
the client's default polling timeout. Ignored if |
None
|
wait_until_complete
|
Optional[bool]
|
If True, ignore timeout and poll indefinitely (subject to process lifetime). |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
JobStatusResponse |
JobStatusResponse
|
unified transport + polling-context result. |
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 | |
start_diarization(media_id, params=None)
Start diarization job with pyannote.ai API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
media_id
|
str
|
The media ID from upload_audio |
required |
params
|
Optional[DiarizationParams]
|
Optional parameters for diarization |
None
|
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]: The job ID if started successfully, None otherwise |
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | |
upload_audio(file_path)
Upload audio file with retry logic for network robustness.
Retries on network errors with exponential backoff. Fails fast on permanent errors (auth, file not found, etc.).
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_client.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
pyannote_diarize
PYANNOTE_FILE_STR = '_pyannote_diarization'
module-attribute
logger = get_child_logger(__name__)
module-attribute
DiarizationProcessor
Orchestrator over a DiarizationService.
This layer delegates to the service for generation and handles persistence.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | |
audio_file_path = audio_file_path.resolve()
instance-attribute
output_path = output_path.resolve() if output_path is not None else self.audio_file_path.parent / f'{self.audio_file_path.stem}{PYANNOTE_FILE_STR}.json'
instance-attribute
params = params
instance-attribute
service = service or PyannoteService(default_client)
instance-attribute
writer = writer or FileResultWriter()
instance-attribute
__init__(audio_file_path, output_path=None, *, service=None, params=None, api_key=None, writer=None)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
export(response=None)
Write the provided or last response to self.output_path.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
169 170 171 172 173 174 175 176 | |
generate(*, wait_until_complete=True)
One-shot convenience: delegate to the service and cache the response.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
get_response(job=None, *, wait_until_complete=False)
Fetch current/final response for a job, caching the last response.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
start()
Start a job and cache its job_id.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
122 123 124 125 126 127 128 | |
FileResultWriter
Default file-system writer to JSON.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
30 31 32 33 34 35 36 37 | |
write(path, response)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
33 34 35 36 37 | |
PyannoteService
Bases: DiarizationService
Concrete implementation of DiarizationService for pyannote.ai.
Bridges transport (PyannoteClient) and mapping (PyannoteAdapter) while exposing a clean domain-facing API.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
adapter = adapter or PyannoteAdapter()
instance-attribute
client = client or PyannoteClient()
instance-attribute
__init__(client=None, adapter=None)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
47 48 49 | |
generate(audio_path, params=None, *, wait_until_complete=True)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
66 67 68 69 70 71 72 73 74 75 | |
get_response(job_id, *, wait_until_complete=False)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
59 60 61 62 63 64 | |
start(audio_path, params=None)
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
52 53 54 55 56 57 | |
diarize(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)
One-shot convenience to generate a result and (optionally) write it.
This returns the DiarizationResponse. Writing is left to callers or
diarize_to_file below.
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | |
diarize_to_file(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)
Convenience helper: generate then export to JSON if successful; returns response
Source code in src/tnh_scholar/audio_processing/diarization/pyannote_diarize.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | |
schemas
DiarizationResponse = Annotated[Union[DiarizationSucceeded, DiarizationFailed, DiarizationPending, DiarizationRunning], Field(discriminator='status')]
module-attribute
__all__ = ['PollOutcome', 'DiarizationParams', 'StartDiarizationResponse', 'JobStatus', 'JobStatusResponse', 'ErrorCode', 'ErrorInfo', 'DiarizationResult', 'DiarizationSucceeded', 'DiarizationFailed', 'DiarizationPending', 'DiarizationRunning', 'DiarizationResponse']
module-attribute
DiarizationFailed
Bases: _BaseResponse
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
168 169 170 | |
error
instance-attribute
status
instance-attribute
DiarizationParams
Bases: BaseModel
Per-request diarization options; maps to pyannote API payload. Use .to_api_dict() to emit API field names.
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
confidence = Field(default=None, ge=0.0, le=1.0, description='Confidence threshold for segments.')
class-attribute
instance-attribute
model_config = ConfigDict(frozen=True, populate_by_name=True, extra='forbid')
class-attribute
instance-attribute
num_speakers = Field(default=None, alias='numSpeakers', description="Fixed number of speakers or 'auto' for detection.")
class-attribute
instance-attribute
webhook = Field(default=None, description='Webhook URL for job status callbacks.')
class-attribute
instance-attribute
to_api_dict()
Return payload dict using API field names (camelCase) and excluding Nones.
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
65 66 67 | |
DiarizationPending
Bases: _BaseResponse
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
173 174 | |
status
instance-attribute
DiarizationResult
Bases: BaseModel
Domain-level diarization payload used by the rest of the system.
NOTE: segments is intentionally typed as list[Any] so that it can
hold your project’s DiarizedSegment instances from models.py without
creating an import cycle. You can tighten this typing later to
list[DiarizedSegment] and import under TYPE_CHECKING if desired.
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
metadata = None
class-attribute
instance-attribute
model_config = ConfigDict(frozen=True, extra='ignore')
class-attribute
instance-attribute
num_speakers = None
class-attribute
instance-attribute
segments
instance-attribute
DiarizationRunning
Bases: _BaseResponse
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
177 178 | |
status
instance-attribute
DiarizationSucceeded
Bases: _BaseResponse
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
163 164 165 | |
result
instance-attribute
status
instance-attribute
ErrorCode
Bases: str, Enum
Client- and adapter-level error taxonomy (not server statuses).
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
115 116 117 118 119 120 121 122 123 124 | |
API_ERROR = 'api_error'
class-attribute
instance-attribute
BAD_REQUEST = 'bad_request'
class-attribute
instance-attribute
CANCELLED = 'cancelled'
class-attribute
instance-attribute
PARSE_ERROR = 'parse_error'
class-attribute
instance-attribute
TIMEOUT = 'timeout'
class-attribute
instance-attribute
TRANSIENT = 'transient'
class-attribute
instance-attribute
UNKNOWN = 'unknown'
class-attribute
instance-attribute
ErrorInfo
Bases: BaseModel
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
127 128 129 130 131 132 | |
code
instance-attribute
details = None
class-attribute
instance-attribute
message
instance-attribute
model_config = ConfigDict(frozen=True, extra='allow')
class-attribute
instance-attribute
JobHandle
dataclass
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
30 31 32 33 | |
backend = 'pyannote'
class-attribute
instance-attribute
job_id
instance-attribute
__init__(job_id, backend='pyannote')
JobStatus
Bases: str, Enum
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
23 24 25 26 27 | |
FAILED = 'failed'
class-attribute
instance-attribute
PENDING = 'pending'
class-attribute
instance-attribute
RUNNING = 'running'
class-attribute
instance-attribute
SUCCEEDED = 'succeeded'
class-attribute
instance-attribute
JobStatusResponse
Bases: BaseModel
Job Status Result (JSR): unified transport payload + client polling context. Combines transport-level fields with client-side polling metadata.
Semantics:
- outcome describes how polling concluded (terminal success/failure, timeout, network error, etc.).
- status is the last known server job status (SUCCEEDED, FAILED, RUNNING, PENDING)
- server_error_msg and payload mirror the remote payload when present.
- polls and elapsed_s report client polling metrics.
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | |
elapsed_s = 0.0
class-attribute
instance-attribute
job_id
instance-attribute
model_config = ConfigDict(frozen=True, extra='ignore')
class-attribute
instance-attribute
outcome
instance-attribute
payload = None
class-attribute
instance-attribute
polls = 0
class-attribute
instance-attribute
server_error_msg = None
class-attribute
instance-attribute
status = None
class-attribute
instance-attribute
PollOutcome
Bases: str, Enum
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
14 15 16 17 18 19 20 | |
ERROR = 'error'
class-attribute
instance-attribute
FAILED = 'failed'
class-attribute
instance-attribute
INTERRUPTED = 'interrupted'
class-attribute
instance-attribute
NETWORK_ERROR = 'network_error'
class-attribute
instance-attribute
SUCCEEDED = 'succeeded'
class-attribute
instance-attribute
TIMEOUT = 'timeout'
class-attribute
instance-attribute
StartDiarizationResponse
Bases: BaseModel
Minimal typed view of the start-diarization response.
Source code in src/tnh_scholar/audio_processing/diarization/schemas.py
70 71 72 73 74 75 76 77 | |
job_id = Field(alias='jobId')
class-attribute
instance-attribute
model_config = ConfigDict(frozen=True, extra='ignore')
class-attribute
instance-attribute
strategies
__all__ = ['LanguageDetector', 'LanguageProbe', 'WhisperLanguageDetector', 'group_speaker_blocks', 'TimeGapChunker']
module-attribute
LanguageDetector
Bases: Protocol
Abstract language detector (e.g., fastText, Whisper-lang).
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
38 39 40 41 | |
detect(audio, format_str)
Source code in src/tnh_scholar/audio_processing/diarization/protocols.py
41 | |
LanguageProbe
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |
detector = detector
instance-attribute
export_format = config.language.export_format
instance-attribute
probe_time = config.language.probe_time
instance-attribute
__init__(config, detector)
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
49 50 51 52 | |
segment_language(aug_segment)
Get segment ISO-639 language code from an Augmented Diarize Segment which contains audio.
The probe window is always relative to the segment audio (0=start, duration=end).
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
TimeGapChunker
Bases: ChunkingStrategy
Chunker that ignores speaker/language and uses only time-gap logic.
Source code in src/tnh_scholar/audio_processing/diarization/strategies/time_gap.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
cfg = config
instance-attribute
__init__(config=DiarizationConfig())
Source code in src/tnh_scholar/audio_processing/diarization/strategies/time_gap.py
24 25 | |
extract(segments)
Extract time-based chunks from diarization segments.
Source code in src/tnh_scholar/audio_processing/diarization/strategies/time_gap.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
WhisperLanguageDetector
Language detector using Whisper service.
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
audio_handler = audio_handler or AudioHandler()
instance-attribute
model = model
instance-attribute
__init__(model='whisper-1', audio_handler=None)
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
26 27 28 | |
detect(audio, format_str)
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
30 31 32 33 34 35 36 37 38 39 40 41 | |
group_speaker_blocks(segments, config=DiarizationConfig())
Group contiguous or near-contiguous segments by speaker identity.
Segments are grouped into SpeakerBlocks when the speaker remains the same
and the gap between consecutive segments is less than the configured threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segments
|
List[DiarizedSegment]
|
A list of diarization segments (must be sorted by start time). |
required |
config
|
DiarizationConfig
|
Configuration containing the allowed gap between segments. |
DiarizationConfig()
|
Returns:
| Type | Description |
|---|---|
List[SpeakerBlock]
|
A list of SpeakerBlock objects representing grouped speaker runs. |
Source code in src/tnh_scholar/audio_processing/diarization/strategies/speaker_blocker.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |
language_based
LanguageChunker – chunking informed by speaker blocks + language probing.
logger = get_child_logger(__name__)
module-attribute
LanguageChunker
Bases: ChunkingStrategy
Strategy:
- Group contiguous segments into SpeakerBlock objects.
- For each block longer than
language_probe_thresholdprobe language at configurable offsets; if mismatch, split on language change. - Build chunks respecting
target_timesimilar to TimeGapChunker.
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_based.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
cfg = cfg
instance-attribute
detector = detector
instance-attribute
fetcher = fetcher
instance-attribute
lang_thresh = language_probe_threshold
instance-attribute
__init__(cfg=ChunkConfig(), fetcher=None, detector=None, language_probe_threshold=TimeMs(90000))
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_based.py
33 34 35 36 37 38 39 40 41 42 43 | |
extract(segments)
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_based.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | |
language_probe
Lightweight language-detection helpers pluggable into chunkers.
logger = get_child_logger(__name__)
module-attribute
LanguageProbe
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |
detector = detector
instance-attribute
export_format = config.language.export_format
instance-attribute
probe_time = config.language.probe_time
instance-attribute
__init__(config, detector)
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
49 50 51 52 | |
segment_language(aug_segment)
Get segment ISO-639 language code from an Augmented Diarize Segment which contains audio.
The probe window is always relative to the segment audio (0=start, duration=end).
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
WhisperLanguageDetector
Language detector using Whisper service.
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
audio_handler = audio_handler or AudioHandler()
instance-attribute
model = model
instance-attribute
__init__(model='whisper-1', audio_handler=None)
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
26 27 28 | |
detect(audio, format_str)
Source code in src/tnh_scholar/audio_processing/diarization/strategies/language_probe.py
30 31 32 33 34 35 36 37 38 39 40 41 | |
speaker_blocker
group_speaker_blocks(segments, config=DiarizationConfig())
Group contiguous or near-contiguous segments by speaker identity.
Segments are grouped into SpeakerBlocks when the speaker remains the same
and the gap between consecutive segments is less than the configured threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segments
|
List[DiarizedSegment]
|
A list of diarization segments (must be sorted by start time). |
required |
config
|
DiarizationConfig
|
Configuration containing the allowed gap between segments. |
DiarizationConfig()
|
Returns:
| Type | Description |
|---|---|
List[SpeakerBlock]
|
A list of SpeakerBlock objects representing grouped speaker runs. |
Source code in src/tnh_scholar/audio_processing/diarization/strategies/speaker_blocker.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |
time_gap
TimeGapChunker – baseline strategy: split purely on accumulated time.
logger = get_child_logger(__name__)
module-attribute
TimeGapChunker
Bases: ChunkingStrategy
Chunker that ignores speaker/language and uses only time-gap logic.
Source code in src/tnh_scholar/audio_processing/diarization/strategies/time_gap.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
cfg = config
instance-attribute
__init__(config=DiarizationConfig())
Source code in src/tnh_scholar/audio_processing/diarization/strategies/time_gap.py
24 25 | |
extract(segments)
Extract time-based chunks from diarization segments.
Source code in src/tnh_scholar/audio_processing/diarization/strategies/time_gap.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
timeline_mapper
Timeline mapping utilities for transforming timestamps from chunk-relative coordinates to original audio coordinates.
This module enables mapping transcript segments back to their original positions in the source audio after processing chunked audio.
logger = get_child_logger(__name__)
module-attribute
TimelineMapper
Maps timestamps from chunk-relative coordinates to original audio coordinates.
Source code in src/tnh_scholar/audio_processing/diarization/timeline_mapper.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | |
config = config or TimelineMapperConfig()
instance-attribute
__init__(config=None)
Initialize with optional configuration.
Source code in src/tnh_scholar/audio_processing/diarization/timeline_mapper.py
38 39 40 | |
remap(timed_text, chunk)
Remap all timestamps in a TimedText object from chunk-relative to original audio coordinates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timed_text
|
TimedText
|
TimedText with chunk-relative timestamps |
required |
chunk
|
DiarizationChunk
|
DiarizationChunk containing mapping information |
required |
Returns:
| Type | Description |
|---|---|
TimedText
|
New TimedText object with remapped timestamps |
Source code in src/tnh_scholar/audio_processing/diarization/timeline_mapper.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
TimelineMapperConfig
Bases: BaseModel
Configuration options for timeline mapping.
Source code in src/tnh_scholar/audio_processing/diarization/timeline_mapper.py
22 23 24 25 26 27 28 29 30 31 32 | |
debug_logging = Field(default=False, description='Enable detailed logging of mapping decisions')
class-attribute
instance-attribute
map_speakers = Field(default=True, description='Assign speaker to mapped timings using diarization segment speaker.')
class-attribute
instance-attribute
types
PyannoteEntry
Bases: TypedDict
Source code in src/tnh_scholar/audio_processing/diarization/types.py
4 5 6 7 | |
end
instance-attribute
speaker
instance-attribute
start
instance-attribute
viewer
close_segment_viewer(pid)
Terminate the Streamlit viewer process by PID.
Source code in src/tnh_scholar/audio_processing/diarization/viewer.py
44 45 46 47 48 49 50 | |
launch_segment_viewer(segments, master_audio_file)
Export segment data to a temporary JSON file and launch Streamlit viewer. Args: segments: List of dicts with diarization info (start, end, speaker). master_audio_file: Path to the master audio file.
Source code in src/tnh_scholar/audio_processing/diarization/viewer.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
load_segments_from_file(path)
Source code in src/tnh_scholar/audio_processing/diarization/viewer.py
53 54 55 | |
main()
Source code in src/tnh_scholar/audio_processing/diarization/viewer.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | |
timed_object
__all__ = ['Granularity', 'TimedText', 'TimedTextUnit']
module-attribute
Granularity
Bases: str, Enum
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
19 20 21 | |
SEGMENT = 'segment'
class-attribute
instance-attribute
WORD = 'word'
class-attribute
instance-attribute
TimedText
Bases: BaseModel
Represents a collection of timed text units of a single granularity.
Only one of segments or words is populated, determined by granularity.
All units must match the declared granularity.
Notes
- Start times must be non-decreasing (overlaps allowed for multiple speakers).
- Negative start_ms or end_ms values are not allowed.
- Durations must be strictly positive (>0 ms).
- Mixed granularity is strictly prohibited.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | |
duration
property
Get the total duration in milliseconds.
end_ms
property
Get the end time of the latest unit.
granularity = Field(..., description='Granularity type for all units.')
class-attribute
instance-attribute
segments = Field(default_factory=list, description='Phrase-level timed units')
class-attribute
instance-attribute
start_ms
property
Get the start time of the earliest unit.
units
property
Return the list of units matching the granularity.
words = Field(default_factory=list, description='Word-level timed units')
class-attribute
instance-attribute
__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)
Custom initializer for TimedText.
If units is provided, granularity is inferred from the first unit unless explicitly set.
If only segments or words is provided, granularity is set accordingly.
If all are empty, granularity must be provided.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | |
__len__()
Return the number of units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
234 235 236 | |
append(unit)
Add a unit to the end.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
238 239 240 241 242 243 | |
clear()
Remove all units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
250 251 252 | |
export_text(separator='\n', skip_empty=True, show_speaker=True)
Export the text content of all units as a single string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
separator
|
str
|
String used to separate units (default: newline). |
'\n'
|
skip_empty
|
bool
|
If True, skip units with empty or whitespace-only text. |
True
|
show_speaker
|
If True, add speaker info. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Concatenated text of all units, separated by |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | |
extend(units)
Add multiple units to the end.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
245 246 247 248 | |
filter_by_min_duration(min_duration_ms)
Return a new TimedText object containing only units with a minimum duration.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
311 312 313 314 315 316 317 318 319 | |
is_segment_granularity()
Return True if granularity is SEGMENT.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
211 212 213 | |
is_word_granularity()
Return True if granularity is WORD.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
215 216 217 | |
iter()
Unified iterator over the units of the correct granularity.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
339 340 341 342 343 | |
iter_segments()
Iterate over segment-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not SEGMENT. |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
345 346 347 348 349 350 351 352 353 354 | |
iter_words()
Iterate over word-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not WORD. |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
356 357 358 359 360 361 362 363 364 365 | |
merge(items)
classmethod
Merge a list of TimedText objects of the same granularity into a single TimedText object.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 | |
model_post_init(__context)
After initialization, sort units by start time and normalize durations.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
198 199 200 201 202 203 204 | |
set_all_speakers(speaker)
Set the same speaker for all units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
260 261 262 263 | |
set_speaker(index, speaker)
Set speaker for a specific unit by index.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
254 255 256 257 258 | |
shift(offset_ms)
Shift all units by a given offset in milliseconds.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
265 266 267 268 | |
slice(start_ms, end_ms)
Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
300 301 302 303 304 305 306 307 308 309 | |
sort_by_start()
Sort units by start time.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
270 271 272 | |
TimedTextUnit
Bases: BaseModel
Represents a timed unit with timestamps.
A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | |
confidence = Field(None, description='Optional confidence score')
class-attribute
instance-attribute
duration_ms
property
Get duration in milliseconds.
duration_sec
property
Get duration in seconds.
end_ms = Field(..., description='End time in milliseconds')
class-attribute
instance-attribute
end_sec
property
Get end time in seconds.
granularity
instance-attribute
index = Field(None, description='Entry index or sequence number')
class-attribute
instance-attribute
speaker = Field(None, description='Speaker identifier if available')
class-attribute
instance-attribute
start_ms = Field(..., description='Start time in milliseconds')
class-attribute
instance-attribute
start_sec
property
Get start time in seconds.
text = Field(..., description='The text content')
class-attribute
instance-attribute
normalize()
Normalize the duration of the segment to be nonzero
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
77 78 79 80 | |
overlaps_with(other)
Check if this unit overlaps with another.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
68 69 70 71 | |
set_speaker(speaker)
Set the speaker label.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
73 74 75 | |
shift_time(offset_ms)
Create a new TimedUnit with timestamps shifted by offset.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
59 60 61 62 63 64 65 66 | |
timed_text
Module for handling timed text objects. For example, can be used subtitles like VTT and SRT.
This module provides classes and utilities for parsing, manipulating, and generating timed text objects useful in subtitle and transcript processing. It uses Pydantic for robust data validation and type safety.
Granularity
Bases: str, Enum
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
19 20 21 | |
SEGMENT = 'segment'
class-attribute
instance-attribute
WORD = 'word'
class-attribute
instance-attribute
TimedText
Bases: BaseModel
Represents a collection of timed text units of a single granularity.
Only one of segments or words is populated, determined by granularity.
All units must match the declared granularity.
Notes
- Start times must be non-decreasing (overlaps allowed for multiple speakers).
- Negative start_ms or end_ms values are not allowed.
- Durations must be strictly positive (>0 ms).
- Mixed granularity is strictly prohibited.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | |
duration
property
Get the total duration in milliseconds.
end_ms
property
Get the end time of the latest unit.
granularity = Field(..., description='Granularity type for all units.')
class-attribute
instance-attribute
segments = Field(default_factory=list, description='Phrase-level timed units')
class-attribute
instance-attribute
start_ms
property
Get the start time of the earliest unit.
units
property
Return the list of units matching the granularity.
words = Field(default_factory=list, description='Word-level timed units')
class-attribute
instance-attribute
__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)
Custom initializer for TimedText.
If units is provided, granularity is inferred from the first unit unless explicitly set.
If only segments or words is provided, granularity is set accordingly.
If all are empty, granularity must be provided.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | |
__len__()
Return the number of units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
234 235 236 | |
append(unit)
Add a unit to the end.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
238 239 240 241 242 243 | |
clear()
Remove all units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
250 251 252 | |
export_text(separator='\n', skip_empty=True, show_speaker=True)
Export the text content of all units as a single string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
separator
|
str
|
String used to separate units (default: newline). |
'\n'
|
skip_empty
|
bool
|
If True, skip units with empty or whitespace-only text. |
True
|
show_speaker
|
If True, add speaker info. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Concatenated text of all units, separated by |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | |
extend(units)
Add multiple units to the end.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
245 246 247 248 | |
filter_by_min_duration(min_duration_ms)
Return a new TimedText object containing only units with a minimum duration.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
311 312 313 314 315 316 317 318 319 | |
is_segment_granularity()
Return True if granularity is SEGMENT.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
211 212 213 | |
is_word_granularity()
Return True if granularity is WORD.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
215 216 217 | |
iter()
Unified iterator over the units of the correct granularity.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
339 340 341 342 343 | |
iter_segments()
Iterate over segment-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not SEGMENT. |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
345 346 347 348 349 350 351 352 353 354 | |
iter_words()
Iterate over word-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not WORD. |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
356 357 358 359 360 361 362 363 364 365 | |
merge(items)
classmethod
Merge a list of TimedText objects of the same granularity into a single TimedText object.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 | |
model_post_init(__context)
After initialization, sort units by start time and normalize durations.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
198 199 200 201 202 203 204 | |
set_all_speakers(speaker)
Set the same speaker for all units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
260 261 262 263 | |
set_speaker(index, speaker)
Set speaker for a specific unit by index.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
254 255 256 257 258 | |
shift(offset_ms)
Shift all units by a given offset in milliseconds.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
265 266 267 268 | |
slice(start_ms, end_ms)
Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
300 301 302 303 304 305 306 307 308 309 | |
sort_by_start()
Sort units by start time.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
270 271 272 | |
TimedTextUnit
Bases: BaseModel
Represents a timed unit with timestamps.
A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | |
confidence = Field(None, description='Optional confidence score')
class-attribute
instance-attribute
duration_ms
property
Get duration in milliseconds.
duration_sec
property
Get duration in seconds.
end_ms = Field(..., description='End time in milliseconds')
class-attribute
instance-attribute
end_sec
property
Get end time in seconds.
granularity
instance-attribute
index = Field(None, description='Entry index or sequence number')
class-attribute
instance-attribute
speaker = Field(None, description='Speaker identifier if available')
class-attribute
instance-attribute
start_ms = Field(..., description='Start time in milliseconds')
class-attribute
instance-attribute
start_sec
property
Get start time in seconds.
text = Field(..., description='The text content')
class-attribute
instance-attribute
normalize()
Normalize the duration of the segment to be nonzero
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
77 78 79 80 | |
overlaps_with(other)
Check if this unit overlaps with another.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
68 69 70 71 | |
set_speaker(speaker)
Set the speaker label.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
73 74 75 | |
shift_time(offset_ms)
Create a new TimedUnit with timestamps shifted by offset.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
59 60 61 62 63 64 65 66 | |
transcription
__all__ = ['patch_whisper_options', 'DiarizationChunker', 'TimedText', 'TextSegmentBuilder', 'TimedTextUnit', 'Granularity', 'TranscriptionService', 'TranscriptionServiceFactory']
module-attribute
DiarizationChunker
Class for chunking diarization results into processing units based on configurable duration targets.
Source code in src/tnh_scholar/audio_processing/diarization/chunker.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | |
config = ChunkConfig()
instance-attribute
__init__(**config_options)
Initialize chunker with additional config_options.
Source code in src/tnh_scholar/audio_processing/diarization/chunker.py
20 21 22 23 24 | |
extract_contiguous_chunks(segments)
Split diarization segments into contiguous chunks of approximately target_duration, without splitting on speaker changes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segments
|
List[DiarizedSegment]
|
List of speaker segments from diarization |
required |
Returns:
| Type | Description |
|---|---|
List[DiarizationChunk]
|
List[Chunk]: Flat list of contiguous chunks |
Source code in src/tnh_scholar/audio_processing/diarization/chunker.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
Granularity
Bases: str, Enum
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
19 20 21 | |
SEGMENT = 'segment'
class-attribute
instance-attribute
WORD = 'word'
class-attribute
instance-attribute
TextSegmentBuilder
Source code in src/tnh_scholar/audio_processing/transcription/text_segment_builder.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
avoid_orphans = avoid_orphans
instance-attribute
current_characters = 0
instance-attribute
current_words = []
instance-attribute
ignore_speaker = ignore_speaker
instance-attribute
max_duration = max_duration_ms
instance-attribute
max_gap_duration = max_gap_duration_ms
instance-attribute
segments = []
instance-attribute
target_characters = target_characters
instance-attribute
__init__(*, max_duration_ms=None, target_characters=None, avoid_orphans=True, max_gap_duration_ms=None, ignore_speaker=True)
Source code in src/tnh_scholar/audio_processing/transcription/text_segment_builder.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
build_segments(*, target_duration=None, target_characters=None, avoid_orphans=True, max_gap_duration=None, ignore_speaker=False)
Build or rebuild segments from the contents of words.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_duration
|
Optional[int]
|
Maximum desired segment duration in milliseconds. |
None
|
target_characters
|
Optional[int]
|
Maximum desired character length of a segment. |
None
|
speaker_split
|
Whether to start a new segment when the speaker changes. |
required |
Note
This is a stub. Concrete algorithms will be implemented later.
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Always, until implemented. |
Source code in src/tnh_scholar/audio_processing/transcription/text_segment_builder.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
create_segments(timed_text)
Source code in src/tnh_scholar/audio_processing/transcription/text_segment_builder.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |
TimedText
Bases: BaseModel
Represents a collection of timed text units of a single granularity.
Only one of segments or words is populated, determined by granularity.
All units must match the declared granularity.
Notes
- Start times must be non-decreasing (overlaps allowed for multiple speakers).
- Negative start_ms or end_ms values are not allowed.
- Durations must be strictly positive (>0 ms).
- Mixed granularity is strictly prohibited.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | |
duration
property
Get the total duration in milliseconds.
end_ms
property
Get the end time of the latest unit.
granularity = Field(..., description='Granularity type for all units.')
class-attribute
instance-attribute
segments = Field(default_factory=list, description='Phrase-level timed units')
class-attribute
instance-attribute
start_ms
property
Get the start time of the earliest unit.
units
property
Return the list of units matching the granularity.
words = Field(default_factory=list, description='Word-level timed units')
class-attribute
instance-attribute
__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)
Custom initializer for TimedText.
If units is provided, granularity is inferred from the first unit unless explicitly set.
If only segments or words is provided, granularity is set accordingly.
If all are empty, granularity must be provided.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | |
__len__()
Return the number of units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
234 235 236 | |
append(unit)
Add a unit to the end.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
238 239 240 241 242 243 | |
clear()
Remove all units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
250 251 252 | |
export_text(separator='\n', skip_empty=True, show_speaker=True)
Export the text content of all units as a single string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
separator
|
str
|
String used to separate units (default: newline). |
'\n'
|
skip_empty
|
bool
|
If True, skip units with empty or whitespace-only text. |
True
|
show_speaker
|
If True, add speaker info. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Concatenated text of all units, separated by |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | |
extend(units)
Add multiple units to the end.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
245 246 247 248 | |
filter_by_min_duration(min_duration_ms)
Return a new TimedText object containing only units with a minimum duration.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
311 312 313 314 315 316 317 318 319 | |
is_segment_granularity()
Return True if granularity is SEGMENT.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
211 212 213 | |
is_word_granularity()
Return True if granularity is WORD.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
215 216 217 | |
iter()
Unified iterator over the units of the correct granularity.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
339 340 341 342 343 | |
iter_segments()
Iterate over segment-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not SEGMENT. |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
345 346 347 348 349 350 351 352 353 354 | |
iter_words()
Iterate over word-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not WORD. |
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
356 357 358 359 360 361 362 363 364 365 | |
merge(items)
classmethod
Merge a list of TimedText objects of the same granularity into a single TimedText object.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 | |
model_post_init(__context)
After initialization, sort units by start time and normalize durations.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
198 199 200 201 202 203 204 | |
set_all_speakers(speaker)
Set the same speaker for all units.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
260 261 262 263 | |
set_speaker(index, speaker)
Set speaker for a specific unit by index.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
254 255 256 257 258 | |
shift(offset_ms)
Shift all units by a given offset in milliseconds.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
265 266 267 268 | |
slice(start_ms, end_ms)
Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
300 301 302 303 304 305 306 307 308 309 | |
sort_by_start()
Sort units by start time.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
270 271 272 | |
TimedTextUnit
Bases: BaseModel
Represents a timed unit with timestamps.
A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | |
confidence = Field(None, description='Optional confidence score')
class-attribute
instance-attribute
duration_ms
property
Get duration in milliseconds.
duration_sec
property
Get duration in seconds.
end_ms = Field(..., description='End time in milliseconds')
class-attribute
instance-attribute
end_sec
property
Get end time in seconds.
granularity
instance-attribute
index = Field(None, description='Entry index or sequence number')
class-attribute
instance-attribute
speaker = Field(None, description='Speaker identifier if available')
class-attribute
instance-attribute
start_ms = Field(..., description='Start time in milliseconds')
class-attribute
instance-attribute
start_sec
property
Get start time in seconds.
text = Field(..., description='The text content')
class-attribute
instance-attribute
normalize()
Normalize the duration of the segment to be nonzero
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
77 78 79 80 | |
overlaps_with(other)
Check if this unit overlaps with another.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
68 69 70 71 | |
set_speaker(speaker)
Set the speaker label.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
73 74 75 | |
shift_time(offset_ms)
Create a new TimedUnit with timestamps shifted by offset.
Source code in src/tnh_scholar/audio_processing/timed_object/timed_text.py
59 60 61 62 63 64 65 66 | |
TranscriptionService
Bases: ABC
Abstract base class defining the interface for transcription services.
This interface provides a standard way to interact with different transcription service providers (e.g., OpenAI Whisper, AssemblyAI).
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
get_result(job_id)
abstractmethod
Get results for an existing transcription job.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
ID of the transcription job |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results in the same |
TranscriptionResult
|
standardized format as transcribe() |
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
64 65 66 67 68 69 70 71 72 73 74 75 76 | |
transcribe(audio_file, options=None)
abstractmethod
Transcribe audio file to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path to audio file or file-like object |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription |
None
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
TranscriptionResult |
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)
abstractmethod
Transcribe audio and return result in specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path, file-like object, or URL of audio file |
required |
format_type
|
str
|
Format type (e.g., "srt", "vtt", "text") |
'srt'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription |
None
|
format_options
|
Optional[Dict[str, Any]]
|
Format-specific options |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
TranscriptionServiceFactory
Factory for creating transcription service instances.
This factory provides a standard way to create transcription service instances based on the provider name and configuration.
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
create_service(provider='assemblyai', api_key=None, **kwargs)
classmethod
Create a transcription service instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider
|
str
|
Service provider name (e.g., "whisper", "assemblyai") |
'assemblyai'
|
api_key
|
Optional[str]
|
API key for the service |
None
|
**kwargs
|
Additional provider-specific configuration |
{}
|
Returns:
| Type | Description |
|---|---|
TranscriptionService
|
TranscriptionService instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provider is not supported |
ImportError
|
If the provider module cannot be imported |
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
register_provider(name, provider_class)
classmethod
Register a provider implementation with the factory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Provider name (lowercase) |
required |
provider_class
|
Callable[..., TranscriptionService]
|
Provider implementation class or factory function |
required |
Example
from my_module import MyTranscriptionService TranscriptionServiceFactory.register_provider("my_provider", MyTranscriptionService)
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |
patch_whisper_options(options, file_extension)
Patch routine to ensure 'file_extension' is present in transcription options dict. This is a workaround for OpenAI Whisper API, which requires file-like objects to have a filename/extension. Only allows known audio extensions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
options
|
Optional[Dict[str, Any]]
|
Transcription options dictionary (will not be mutated) |
required |
file_extension
|
str
|
File extension string (with or without leading dot) |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
New options dictionary with 'file_extension' set appropriately |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file_extension is not in the allowed list |
Source code in src/tnh_scholar/audio_processing/transcription/patches.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | |
assemblyai_service
AssemblyAI implementation of the TranscriptionService interface.
This module provides a complete implementation of the TranscriptionService interface using the AssemblyAI Python SDK, with support for all major features including:
- Transcription with configurable options
- Speaker diarization
- Automatic language detection
- Audio intelligence features
- Subtitle generation
- Regional endpoint support
- Webhook callbacks
The implementation follows a modular design with single-action methods and supports both synchronous and asynchronous usage patterns.
logger = get_child_logger(__name__)
module-attribute
AAIConfig
dataclass
Comprehensive configuration for AssemblyAI transcription service.
This class contains all configurable options for the AssemblyAI API, organized by feature category.
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | |
api_key = None
class-attribute
instance-attribute
auto_chapters = False
class-attribute
instance-attribute
auto_highlights = False
class-attribute
instance-attribute
chars_per_caption = 60
class-attribute
instance-attribute
content_safety = False
class-attribute
instance-attribute
custom_spelling = field(default_factory=dict)
class-attribute
instance-attribute
disfluencies = False
class-attribute
instance-attribute
dual_channel = False
class-attribute
instance-attribute
entity_detection = False
class-attribute
instance-attribute
filter_profanity = False
class-attribute
instance-attribute
format_text = True
class-attribute
instance-attribute
iab_categories = False
class-attribute
instance-attribute
language_code = None
class-attribute
instance-attribute
language_detection = True
class-attribute
instance-attribute
polling_interval = 4
class-attribute
instance-attribute
punctuate = True
class-attribute
instance-attribute
sentiment_analysis = False
class-attribute
instance-attribute
speaker_labels = True
class-attribute
instance-attribute
speakers_expected = None
class-attribute
instance-attribute
speech_model = SpeechModel.BEST
class-attribute
instance-attribute
summarization = False
class-attribute
instance-attribute
use_eu_endpoint = False
class-attribute
instance-attribute
webhook_auth_header_name = None
class-attribute
instance-attribute
webhook_auth_header_value = None
class-attribute
instance-attribute
webhook_url = None
class-attribute
instance-attribute
word_boost = field(default_factory=list)
class-attribute
instance-attribute
__init__(api_key=None, use_eu_endpoint=False, polling_interval=4, speech_model=SpeechModel.BEST, language_code=None, language_detection=True, dual_channel=False, format_text=True, punctuate=True, disfluencies=False, filter_profanity=False, chars_per_caption=60, speaker_labels=True, speakers_expected=None, custom_spelling=dict(), word_boost=list(), auto_chapters=False, auto_highlights=False, entity_detection=False, iab_categories=False, sentiment_analysis=False, summarization=False, content_safety=False, webhook_url=None, webhook_auth_header_name=None, webhook_auth_header_value=None)
AAITranscriptionService
Bases: TranscriptionService
AssemblyAI implementation of the TranscriptionService interface.
Provides comprehensive access to AssemblyAI's transcription services with support for all major features through the official Python SDK.
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 | |
config = AAIConfig()
instance-attribute
format_converter = FormatConverter()
instance-attribute
transcriber = aai.Transcriber(config=(self._create_transcription_config(options)))
instance-attribute
__init__(api_key=None, options=None)
Initialize the AssemblyAI transcription service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
AssemblyAI API key (defaults to ASSEMBLYAI_API_KEY env var) |
None
|
config
|
Comprehensive configuration options |
required |
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
get_result(job_id)
Get results for an existing transcription job.
This method blocks until the transcript is retrieved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
ID of the transcription job |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results |
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 | |
get_subtitles(transcript_id, format_type='srt')
Get subtitles directly from AssemblyAI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript_id
|
str
|
ID of the transcription job |
required |
format_type
|
str
|
Format type ("srt" or "vtt") |
'srt'
|
chars_per_caption
|
Maximum characters per caption |
required |
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the format type is not supported |
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 | |
standardize_result(transcript)
Standardize AssemblyAI transcript to match common format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript
|
Transcript
|
AssemblyAI transcript object |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Standardized result dictionary |
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 | |
transcribe(audio_file, options=None)
Transcribe audio file to text using AssemblyAI's synchronous SDK approach.
This method handles: - File paths - File-like objects - URLs
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BinaryIO, str]
|
Path, file-like object, or URL of audio file |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription |
None
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing standardized transcription results |
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | |
transcribe_async(audio_file, options=None)
Submit an asynchronous transcription job using AssemblyAI's SDK.
This method submits a transcription job and returns immediately with a transcript ID that can be used to retrieve results later.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BinaryIO, str]
|
Path, file-like object, or URL of audio file |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription |
None
|
Returns:
| Type | Description |
|---|---|
Future
|
String containing the transcript ID for later retrieval |
Notes
The SDK's submit method returns a Future object, but this method extracts just the transcript ID for simpler handling.
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 | |
transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)
Transcribe audio and return result in specified format.
Takes advantage of the direct subtitle generation functionality when requesting SRT or VTT formats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BinaryIO, str]
|
Path, file-like object, or URL of audio file |
required |
format_type
|
str
|
Format type (e.g., "srt", "vtt", "text") |
'srt'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription |
None
|
format_options
|
Optional[Dict[str, Any]]
|
Format-specific options |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 | |
SpeechModel
Bases: str, Enum
Supported AssemblyAI speech models.
Source code in src/tnh_scholar/audio_processing/transcription/assemblyai_service.py
41 42 43 44 | |
BEST = 'best'
class-attribute
instance-attribute
NANO = 'nano'
class-attribute
instance-attribute
format_converter
tnh_scholar.audio_processing.transcription.format_converter
Thin facade that turns raw transcription-service output dictionaries into the formats requested by callers (plain-text, SRT - VTT coming later).
Core heavy lifting now lives in:
TimedText/TimedTextUnit- canonical internal representationSegmentBuilder- word-level -> sentence/segment chunkingSRTProcessor- rendering to.srt
Only one public method remains: :py:meth:FormatConverter.convert.
logger = get_child_logger(__name__)
module-attribute
FormatConverter
Convert a raw transcription result to text, SRT, or (placeholder) VTT.
The raw result must follow the loose schema
- {"utterances": [...]} -> already speaker-segmented
- {"words": [...]} -> word-level; we chunk via :class:SegmentBuilder
- {"text": "...", "audio_duration_ms": 12345} -> single blob fallback
Source code in src/tnh_scholar/audio_processing/transcription/format_converter.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
config = config or FormatConverterConfig()
instance-attribute
__init__(config=None)
Source code in src/tnh_scholar/audio_processing/transcription/format_converter.py
59 60 61 62 63 64 65 66 67 | |
convert(result, format_type='srt', format_options=None)
Convert result to the given format_type.
Parameters
result : dict
Raw transcription output.
format_type : {"srt", "text", "vtt"}
format_options : dict | None
Currently only {"include_speaker": bool} recognized for srt.
Source code in src/tnh_scholar/audio_processing/transcription/format_converter.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |
FormatConverterConfig
Bases: BaseModel
User-tunable knobs for :class:FormatConverter.
Only a handful remain now that the heavy logic moved to SegmentBuilder.
Source code in src/tnh_scholar/audio_processing/transcription/format_converter.py
36 37 38 39 40 41 42 43 44 45 46 47 | |
characters_per_entry = 42
class-attribute
instance-attribute
include_segment_index = True
class-attribute
instance-attribute
include_speaker = True
class-attribute
instance-attribute
max_entry_duration_ms = 6000
class-attribute
instance-attribute
max_gap_duration_ms = 2000
class-attribute
instance-attribute
patches
patch_file_with_name(file_obj, extension)
Ensures the file-like object has a .name attribute with the correct extension.
Source code in src/tnh_scholar/audio_processing/transcription/patches.py
10 11 12 13 14 15 | |
patch_whisper_options(options, file_extension)
Patch routine to ensure 'file_extension' is present in transcription options dict. This is a workaround for OpenAI Whisper API, which requires file-like objects to have a filename/extension. Only allows known audio extensions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
options
|
Optional[Dict[str, Any]]
|
Transcription options dictionary (will not be mutated) |
required |
file_extension
|
str
|
File extension string (with or without leading dot) |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
New options dictionary with 'file_extension' set appropriately |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file_extension is not in the allowed list |
Source code in src/tnh_scholar/audio_processing/transcription/patches.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | |
srt_processor
SRTConfig
Configuration options for SRT processing.
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | |
include_speaker = include_speaker
instance-attribute
max_chars_per_line = max_chars_per_line
instance-attribute
reindex_entries = reindex_entries
instance-attribute
speaker_format = speaker_format
instance-attribute
timestamp_format = timestamp_format
instance-attribute
use_pysrt = use_pysrt
instance-attribute
__init__(include_speaker=False, speaker_format='[{speaker}] {text}', reindex_entries=True, timestamp_format='{:02d}:{:02d}:{:02d},{:03d}', max_chars_per_line=42, use_pysrt=False)
Initialize with default settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_speaker
|
Whether to include speaker labels in output |
False
|
|
speaker_format
|
Format string for speaker attribution |
'[{speaker}] {text}'
|
|
reindex_entries
|
Whether to reindex entries sequentially |
True
|
|
timestamp_format
|
Format string for timestamp formatting |
'{:02d}:{:02d}:{:02d},{:03d}'
|
|
max_chars_per_line
|
Maximum characters per line before splitting |
42
|
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | |
SRTProcessor
Handles parsing and generating SRT format.
Provides functionality to convert between SRT text format and TimedText objects, with various formatting options. Supports both native parsing/generation and pysrt backend.
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 | |
config = config or SRTConfig()
instance-attribute
__init__(config=None)
Initialize with optional configuration overrides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Optional[SRTConfig]
|
Configuration options for SRT processing |
None
|
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
69 70 71 72 73 74 75 76 | |
add_speaker_labels(srt_content, *, speaker=None, speaker_labels=None)
Unified entry point for adding speaker labels. (Not implemented yet.)
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
172 173 174 175 176 177 178 179 180 181 182 183 | |
assign_single_speaker(srt_content, speaker)
Assign the same speaker to all segments in the SRT content.
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
155 156 157 158 159 160 161 | |
assign_speaker_by_mapping(srt_content, speaker_labels)
Assign speakers to segments based on a mapping of speaker to segment indices. (Not implemented yet.)
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
163 164 165 166 167 168 169 170 | |
combine(timed_texts)
Combine multiple lists of TimedText into one, with proper indexing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timed_text_lists
|
List of TimedText lists to combine |
required |
Returns:
| Type | Description |
|---|---|
TimedText
|
Combined list of TimedText objects |
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
generate(timed_text, include_speaker=None)
Generate SRT content from a TimedText object. Uses internal generator or pysrt depending on configuration.
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | |
merge_srts(srt_list)
Merge multiple SRT files into a single SRT string.
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
78 79 80 81 82 | |
parse(srt_content)
Parse SRT content into a new TimedText object. Uses internal parser or pysrt depending on configuration.
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
109 110 111 112 113 114 115 116 117 | |
shift_timestamps(timed_text, offset_ms)
Shift all timestamps by the given offset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timed_texts
|
List of TimedText objects |
required | |
offset_ms
|
int
|
Offset in milliseconds to apply |
required |
Returns:
| Type | Description |
|---|---|
TimedText
|
New list of TimedText objects with adjusted timestamps |
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | |
SubtitleFormat
Bases: str, Enum
Supported subtitle formats.
Source code in src/tnh_scholar/audio_processing/transcription/srt_processor.py
14 15 16 17 18 | |
SRT = 'srt'
class-attribute
instance-attribute
TEXT = 'text'
class-attribute
instance-attribute
VTT = 'vtt'
class-attribute
instance-attribute
text_segment_builder
SegmentBuilder for creating phrase-level segments from word-level TimedText.
This module builds higher-level segments from a TimedText object containing word-level units, based on configurable criteria like duration, character count, punctuation, pauses, and speaker changes.
COMMON_ABBREVIATIONS = frozenset({'adj.', 'adm.', 'adv.', 'al.', 'anon.', 'apr.', 'arc.', 'aug.', 'ave.', 'brig.', 'bros.', 'capt.', 'cmdr.', 'col.', 'comdr.', 'con.', 'corp.', 'cpl.', 'dr.', 'drs.', 'ed.', 'enc.', 'etc.', 'ex.', 'feb.', 'gen.', 'gov.', 'hon.', 'hosp.', 'hr.', 'inc.', 'jan.', 'jr.', 'maj.', 'mar.', 'messrs.', 'mlle.', 'mm.', 'mme.', 'mr.', 'mrs.', 'ms.', 'msgr.', 'nov.', 'oct.', 'op.', 'ord.', 'ph.d.', 'prof.', 'pvt.', 'rep.', 'reps.', 'res.', 'rev.', 'rt.', 'sen.', 'sens.', 'sep.', 'sfc.', 'sgt.', 'sr.', 'st.', 'supt.', 'surg.', 'u.s.', 'v.p.', 'vs.'})
module-attribute
TextSegmentBuilder
Source code in src/tnh_scholar/audio_processing/transcription/text_segment_builder.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
avoid_orphans = avoid_orphans
instance-attribute
current_characters = 0
instance-attribute
current_words = []
instance-attribute
ignore_speaker = ignore_speaker
instance-attribute
max_duration = max_duration_ms
instance-attribute
max_gap_duration = max_gap_duration_ms
instance-attribute
segments = []
instance-attribute
target_characters = target_characters
instance-attribute
__init__(*, max_duration_ms=None, target_characters=None, avoid_orphans=True, max_gap_duration_ms=None, ignore_speaker=True)
Source code in src/tnh_scholar/audio_processing/transcription/text_segment_builder.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
build_segments(*, target_duration=None, target_characters=None, avoid_orphans=True, max_gap_duration=None, ignore_speaker=False)
Build or rebuild segments from the contents of words.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_duration
|
Optional[int]
|
Maximum desired segment duration in milliseconds. |
None
|
target_characters
|
Optional[int]
|
Maximum desired character length of a segment. |
None
|
speaker_split
|
Whether to start a new segment when the speaker changes. |
required |
Note
This is a stub. Concrete algorithms will be implemented later.
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Always, until implemented. |
Source code in src/tnh_scholar/audio_processing/transcription/text_segment_builder.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
create_segments(timed_text)
Source code in src/tnh_scholar/audio_processing/transcription/text_segment_builder.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |
transcription_service
TranscriptionResult
Bases: BaseModel
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
26 27 28 29 30 31 32 33 34 35 | |
audio_duration_ms = None
class-attribute
instance-attribute
confidence = None
class-attribute
instance-attribute
language
instance-attribute
raw_result = None
class-attribute
instance-attribute
status = None
class-attribute
instance-attribute
text
instance-attribute
transcript_id = None
class-attribute
instance-attribute
utterance_timing = None
class-attribute
instance-attribute
word_timing = None
class-attribute
instance-attribute
TranscriptionService
Bases: ABC
Abstract base class defining the interface for transcription services.
This interface provides a standard way to interact with different transcription service providers (e.g., OpenAI Whisper, AssemblyAI).
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
get_result(job_id)
abstractmethod
Get results for an existing transcription job.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
ID of the transcription job |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results in the same |
TranscriptionResult
|
standardized format as transcribe() |
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
64 65 66 67 68 69 70 71 72 73 74 75 76 | |
transcribe(audio_file, options=None)
abstractmethod
Transcribe audio file to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path to audio file or file-like object |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription |
None
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
TranscriptionResult |
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)
abstractmethod
Transcribe audio and return result in specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path, file-like object, or URL of audio file |
required |
format_type
|
str
|
Format type (e.g., "srt", "vtt", "text") |
'srt'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription |
None
|
format_options
|
Optional[Dict[str, Any]]
|
Format-specific options |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
TranscriptionServiceFactory
Factory for creating transcription service instances.
This factory provides a standard way to create transcription service instances based on the provider name and configuration.
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
create_service(provider='assemblyai', api_key=None, **kwargs)
classmethod
Create a transcription service instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider
|
str
|
Service provider name (e.g., "whisper", "assemblyai") |
'assemblyai'
|
api_key
|
Optional[str]
|
API key for the service |
None
|
**kwargs
|
Additional provider-specific configuration |
{}
|
Returns:
| Type | Description |
|---|---|
TranscriptionService
|
TranscriptionService instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provider is not supported |
ImportError
|
If the provider module cannot be imported |
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
register_provider(name, provider_class)
classmethod
Register a provider implementation with the factory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Provider name (lowercase) |
required |
provider_class
|
Callable[..., TranscriptionService]
|
Provider implementation class or factory function |
required |
Example
from my_module import MyTranscriptionService TranscriptionServiceFactory.register_provider("my_provider", MyTranscriptionService)
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |
Utterance
Bases: BaseModel
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
19 20 21 22 23 24 | |
confidence
instance-attribute
end_ms
instance-attribute
speaker
instance-attribute
start_ms
instance-attribute
text
instance-attribute
WordTiming
Bases: BaseModel
Source code in src/tnh_scholar/audio_processing/transcription/transcription_service.py
13 14 15 16 17 | |
confidence
instance-attribute
end_ms
instance-attribute
start_ms
instance-attribute
word
instance-attribute
vtt_processor
VTTConfig
Configuration options for WebVTT processing.
Source code in src/tnh_scholar/audio_processing/transcription/vtt_processor.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
include_speaker = include_speaker
instance-attribute
max_chars_per_line = max_chars_per_line
instance-attribute
reindex_entries = reindex_entries
instance-attribute
speaker_format = speaker_format
instance-attribute
timestamp_format = timestamp_format
instance-attribute
__init__(include_speaker=False, speaker_format='<v {speaker}>{text}', reindex_entries=False, timestamp_format='{:02d}:{:02d}:{:02d}.{:03d}', max_chars_per_line=42)
Initialize with default settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_speaker
|
Whether to include speaker labels in output |
False
|
|
speaker_format
|
Format string for speaker attribution |
'<v {speaker}>{text}'
|
|
reindex_entries
|
Whether to reindex entries sequentially |
False
|
|
timestamp_format
|
Format string for timestamp formatting |
'{:02d}:{:02d}:{:02d}.{:03d}'
|
|
max_chars_per_line
|
Maximum characters per line before splitting |
42
|
Source code in src/tnh_scholar/audio_processing/transcription/vtt_processor.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
VTTProcessor
Handles parsing and generating WebVTT format.
Source code in src/tnh_scholar/audio_processing/transcription/vtt_processor.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
config = config or VTTConfig()
instance-attribute
__init__(config=None)
Initialize with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Optional[VTTConfig]
|
Configuration options for VTT processing |
None
|
Source code in src/tnh_scholar/audio_processing/transcription/vtt_processor.py
41 42 43 44 45 46 47 48 | |
generate(timed_texts)
Generate VTT content from a list of TimedUnit objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timed_texts
|
List[TimedTextUnit]
|
List of TimedUnit objects |
required |
Returns:
| Type | Description |
|---|---|
str
|
String containing VTT formatted content |
Source code in src/tnh_scholar/audio_processing/transcription/vtt_processor.py
63 64 65 66 67 68 69 70 71 72 73 74 | |
parse(vtt_content)
Parse VTT content into a list of TimedUnit objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vtt_content
|
str
|
String containing VTT formatted content |
required |
Returns:
| Type | Description |
|---|---|
List[TimedTextUnit]
|
List of TimedUnit objects |
Source code in src/tnh_scholar/audio_processing/transcription/vtt_processor.py
50 51 52 53 54 55 56 57 58 59 60 61 | |
whisper_service
TODO: MAJOR REFACTOR PLANNED
This module currently mixes persistent service configuration (WhisperConfig) with per-call runtime options, leading to complex validation and logic. Plan is to:
- Refactor so each WhisperTranscriptionService instance is configured once at construction, with all relevant settings (including file-like/path-like mode, file extension, etc).
- Use Pydantic BaseSettings for configuration to normalize configuration and validation according to TNH Scholar style.
- Remove ad-hoc runtime options from the transcribe() entrypoint; all config should be set at init.
- If a different configuration is needed, instantiate a new service object.
- This will simplify validation, error handling, and code logic, and make the contract clear and robust.
- NOTE: This will change the TranscriptionService contract and will require similar changes in other transcription system implementations.
- Update all dependent code and tests accordingly.
logger = get_child_logger(__name__)
module-attribute
WhisperBase
Bases: TypedDict
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
78 79 80 81 | |
duration
instance-attribute
language
instance-attribute
text
instance-attribute
WhisperConfig
dataclass
Configuration for the Whisper transcription service.
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
BASE_PARAMS = ['model', 'language', 'temperature', 'prompt', 'response_format']
class-attribute
instance-attribute
FORMAT_PARAMS = {'verbose_json': ['timestamp_granularities'], 'json': [], 'text': [], 'srt': [], 'vtt': []}
class-attribute
instance-attribute
SUPPORTED_FORMATS = ['json', 'text', 'srt', 'vtt', 'verbose_json']
class-attribute
instance-attribute
chunking_strategy = 'auto'
class-attribute
instance-attribute
language = None
class-attribute
instance-attribute
model = 'whisper-1'
class-attribute
instance-attribute
prompt = None
class-attribute
instance-attribute
response_format = 'verbose_json'
class-attribute
instance-attribute
temperature = None
class-attribute
instance-attribute
timestamp_granularities = field(default_factory=(lambda: ['word']))
class-attribute
instance-attribute
__init__(model='whisper-1', response_format='verbose_json', timestamp_granularities=(lambda: ['word'])(), chunking_strategy='auto', language=None, temperature=None, prompt=None)
to_dict()
Convert configuration to dictionary for API call.
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
120 121 122 123 124 | |
validate()
Validate configuration values.
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
126 127 128 129 130 131 132 133 | |
WhisperResponse
Bases: WhisperBase
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
84 85 86 | |
segments
instance-attribute
words
instance-attribute
WhisperSegment
Bases: TypedDict
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
67 68 69 70 71 72 73 74 75 | |
avg_logprob
instance-attribute
compression_ratio
instance-attribute
end
instance-attribute
id
instance-attribute
no_speech_prob
instance-attribute
start
instance-attribute
temperature
instance-attribute
text
instance-attribute
WhisperTranscriptionService
Bases: TranscriptionService
OpenAI Whisper implementation of the TranscriptionService interface.
Provides transcription services using the OpenAI Whisper API.
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 | |
config = WhisperConfig()
instance-attribute
format_converter = FormatConverter()
instance-attribute
__init__(api_key=None, **config_options)
Initialize the Whisper transcription service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
OpenAI API key (defaults to OPENAI_API_KEY env var) |
None
|
**config_options
|
Additional configuration options |
{}
|
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | |
get_result(job_id)
Get results for an existing transcription job.
Whisper API operates synchronously and doesn't use job IDs, so this method is not implemented.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
ID of the transcription job |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not supported for Whisper |
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 | |
set_api_key(api_key=None)
Set or update the API key.
This method allows refreshing the API key without re-instantiating the class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
OpenAI API key (defaults to OPENAI_API_KEY env var) |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If no API key is provided or found in environment |
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 | |
transcribe(audio_file, options=None)
Transcribe audio file to text using OpenAI Whisper API.
PATCH: If audio_file is a file-like object, options['file_extension'] must be provided (OpenAI API quirk).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path to audio file or file-like object |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription. If audio_file is file-like, must include 'file_extension'. |
None
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results with standardized keys |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file-like object is provided without 'file_extension' in options |
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 | |
transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)
Transcribe audio and return result in specified format.
PATCH: If audio_file is a file-like object, transcription_options['file_extension'] must be provided (OpenAI API quirk).
Takes advantage of the direct subtitle generation functionality when requesting SRT or VTT formats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path, file-like object, or URL of audio file |
required |
format_type
|
str
|
Format type (e.g., "srt", "vtt", "text") |
'srt'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription. If audio_file is file-like, must include 'file_extension'. |
None
|
format_options
|
Optional[Dict[str, Any]]
|
Format-specific options |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file-like object is provided without 'file_extension' in transcription_options |
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 | |
WordEntry
Bases: TypedDict
Source code in src/tnh_scholar/audio_processing/transcription/whisper_service.py
61 62 63 64 | |
end
instance-attribute
start
instance-attribute
word
instance-attribute
utils
__all__ = ['AudioEnhancer', 'get_segment_audio', 'play_audio_segment', 'play_bytes', 'play_from_file', 'play_diarization_segment', 'get_audio_from_file']
module-attribute
AudioEnhancer
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | |
compression_settings = compression_settings
instance-attribute
config = config
instance-attribute
__init__(config=EnhancementConfig(), compression_settings=CompressionSettings())
Initialize with enhancement configuration and compression settings.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |
enhance(input_path, output_path=None)
Apply enhancement routines (compression, EQ, gating, etc.) in a modular fashion. Converts input to FLAC working format for Whisper compatibility.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | |
extract_sample(input_path, start, duration, output_path=None, output_format='flac', codec=None, compression_level=8)
Extract a sample segment from the audio file.
Parameters
input_path : Path Path to the input audio file. start : float Start time in seconds. duration : float Duration in seconds. output_path : Path, optional Output file path. If None, auto-generated from input. output_format : str, default="flac" Output audio format/extension. codec : str, optional Audio codec to use (default: "flac" if output_format is "flac", else None). compression_level : int, default=8 Compression level for supported codecs.
Returns
Path Path to the extracted audio sample.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
get_audio_info(file_path)
Get detailed audio information using ffprobe.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
338 339 340 341 342 343 344 345 346 347 348 | |
play_audio(file_path)
Play audio in notebook for quality assessment.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
334 335 336 | |
get_audio_from_file(audio_file)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
22 23 24 | |
get_segment_audio(segment, audio)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
16 17 | |
play_audio_segment(audio)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
19 20 | |
play_bytes(data, format='wav')
Source code in src/tnh_scholar/audio_processing/utils/playback.py
30 31 32 | |
play_diarization_segment(segment, audio)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
13 14 | |
play_from_file(path)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
26 27 28 | |
audio_enhance
Module review and recommendations:
Big Picture Approach:
Modular, Configurable, and Extensible: Your use of Pydantic models for settings and configs is excellent. It makes the pipeline flexible and easy to tune for different ASR or enhancement needs. Tooling: Leveraging SoX and FFmpeg is a pragmatic choice for robust, high-quality audio processing. Pipeline Structure: The AudioEnhancer class is well-structured, with clear separation of concerns for each processing step (remix, rate, gain, EQ, compand, etc.). Notebook Integration: The play_audio method and use of IPython display is great for interactive, iterative work.
Details & Points You Might Be Missing:
Error Handling & Logging:
You print errors but could benefit from more structured logging (e.g., using Python’s logging module). Consider more granular exception handling, especially for subprocess calls. Testing & Validation:
No unit tests or validation of output audio quality/format are present. Consider adding automated tests (even if just for file existence, format, and basic properties). You could add a method to compare pre/post enhancement SNR, loudness, or other metrics. Documentation & Examples:
While docstrings are good, a usage example (in code or markdown) would help new users. Consider a README or notebook cell that demonstrates a full workflow. Performance:
For large-scale or batch processing, consider parallelization or async processing. Temporary files (e.g., intermediate FLACs) could be managed/cleaned up more robustly. Extensibility:
The pipeline is modular, but adding a “custom steps” hook (e.g., user-defined SoX/FFmpeg args) would make it even more flexible. You might want to support other codecs or output formats for downstream ASR models. Feature Gaps:
The extract_sample method is a TODO. Implementing this would be useful for quick QA or dataset creation. Consider adding Voice Activity Detection (VAD) or silence trimming as optional steps. You could add a “dry run” mode to print the SoX/FFmpeg commands without executing, for debugging. ASR-Specific Enhancements:
You might want to add preset configs for different ASR models (e.g., Whisper, Wav2Vec2, etc.), as they may have different optimal preprocessing. Consider integrating with open-source ASR evaluation tools to close the loop on enhancement effectiveness. General Strategic Recommendations:
Automate QA: Add methods to check output audio quality, duration, and format, and optionally compare to input. Batch Processing: Add a method to process a directory or list of files. Config Export/Import: Allow saving/loading configs as JSON/YAML for reproducibility. CLI/Script Interface: Consider a command-line interface for use outside notebooks. Unit Tests: Add basic tests for each method, especially for error cases. Summary Table:
| Modularity | Good | Add custom step hooks | | Configurability | Excellent | Presets for more ASR models | | Error Handling | Basic | Use logging, more granular exceptions | | Testing | Missing | Add unit tests, output validation | | Documentation | Good | Add usage examples, README | | Extensibility | Good | Support more codecs, batch processing | | ASR Optimization | Good start | Add VAD, silence trim, model-specific configs |
logger = get_child_logger(__name__)
module-attribute
AudioEnhancer
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | |
compression_settings = compression_settings
instance-attribute
config = config
instance-attribute
__init__(config=EnhancementConfig(), compression_settings=CompressionSettings())
Initialize with enhancement configuration and compression settings.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |
enhance(input_path, output_path=None)
Apply enhancement routines (compression, EQ, gating, etc.) in a modular fashion. Converts input to FLAC working format for Whisper compatibility.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | |
extract_sample(input_path, start, duration, output_path=None, output_format='flac', codec=None, compression_level=8)
Extract a sample segment from the audio file.
Parameters
input_path : Path Path to the input audio file. start : float Start time in seconds. duration : float Duration in seconds. output_path : Path, optional Output file path. If None, auto-generated from input. output_format : str, default="flac" Output audio format/extension. codec : str, optional Audio codec to use (default: "flac" if output_format is "flac", else None). compression_level : int, default=8 Compression level for supported codecs.
Returns
Path Path to the extracted audio sample.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
get_audio_info(file_path)
Get detailed audio information using ffprobe.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
338 339 340 341 342 343 344 345 346 347 348 | |
play_audio(file_path)
Play audio in notebook for quality assessment.
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
334 335 336 | |
CompressionSettings
Bases: BaseSettings
Compression settings for audio enhancement routines.
Attributes:
| Name | Type | Description |
|---|---|---|
minimal |
list[str]
|
List of compand arguments for minimal compression. |
light |
list[str]
|
List of compand arguments for light compression. |
moderate |
list[str]
|
List of compand arguments for moderate compression. |
aggressive |
list[str]
|
List of compand arguments for aggressive compression. |
whisper_optimized |
list[str]
|
List of compand arguments for Whisper-optimized compression. |
whisper_aggressive |
list[str]
|
List of compand arguments for aggressive Whisper compression. |
primary_speech_only |
list[str]
|
List of compand arguments for primary speech only. |
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
aggressive = ['0.02,0.1', '8:-70,-55,-45,-35,-25,-15', '-5', '-90', '0.05']
class-attribute
instance-attribute
light = ['0.05,0.2', '6:-60,-50,-40,-30,-20,-10', '-3', '-85', '0.1']
class-attribute
instance-attribute
minimal = ['0.1,0.3', '3:-50,-40,-30,-20', '-3', '-80', '0.2']
class-attribute
instance-attribute
moderate = ['0.03,0.15', '6:-65,-50,-40,-30,-20,-10', '-4', '-85', '0.1']
class-attribute
instance-attribute
primary_speech_only = ['0.005,0.06', '12:-60,-45,-55,-30,-35,-18,-15,-8', '-8', '-60', '0.03']
class-attribute
instance-attribute
whisper_aggressive = ['0.005,0.06', '12:-75,-45,-55,-30,-35,-18,-15,-8', '-8', '-95', '0.03']
class-attribute
instance-attribute
whisper_optimized = ['0.005,0.06', '12:-75,-65,-55,-45,-35,-25,-15,-8', '-8', '-95', '0.03']
class-attribute
instance-attribute
EQSettings
Bases: BaseSettings
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
101 102 103 104 105 106 107 108 109 110 111 112 | |
bass = (-5, 200)
class-attribute
instance-attribute
contrast = 75
class-attribute
instance-attribute
eq_bands = [(100, 0.9, -20), (1500, 1, 4), (4000, 0.6, 15), (10000, 1, -10)]
class-attribute
instance-attribute
highpass_freq = 175
class-attribute
instance-attribute
lowpass_freq = 15000
class-attribute
instance-attribute
treble = (3, 3000)
class-attribute
instance-attribute
EnhancementConfig
Bases: BaseModel
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 | |
channels = 2
class-attribute
instance-attribute
codec = 'flac'
class-attribute
instance-attribute
compression_level = 'aggressive'
class-attribute
instance-attribute
eq = EQSettings()
class-attribute
instance-attribute
force_mono = False
class-attribute
instance-attribute
gate = GateSettings()
class-attribute
instance-attribute
include_eq = True
class-attribute
instance-attribute
include_gate = True
class-attribute
instance-attribute
norm = NormalizationSettings()
class-attribute
instance-attribute
rate = RateSettings()
class-attribute
instance-attribute
remix = RemixSettings()
class-attribute
instance-attribute
sample_rate = 48000
class-attribute
instance-attribute
target_rate = None
class-attribute
instance-attribute
GateSettings
Bases: BaseSettings
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
114 115 | |
gate_params = ['0.1', '0.05', '-inf', '0.1', '-90', '0.1']
class-attribute
instance-attribute
NormalizationSettings
Bases: BaseSettings
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
117 118 | |
norm_level = -1
class-attribute
instance-attribute
RateSettings
Bases: BaseSettings
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
123 124 | |
rate_args = ['-v']
class-attribute
instance-attribute
RemixSettings
Bases: BaseSettings
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
120 121 | |
remix_channels = '1,2'
class-attribute
instance-attribute
compress_wav_to_mp4_vbr(input_wav, output_path=None, quality=8)
Compress WAV to M4A (AAC VBR) using ffmpeg.
Parameters:
input_wav : str or Path Path to the input .wav file output_path : str or Path, optional Output .mp4 file path. If None, auto-generated from input quality : int, default=8 VBR quality level: 1 = good (~96kbps), 2 = very good (~128kbps), 3+ = higher bitrate
Returns:
Path Path to the compressed .m4a file
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 | |
get_sox_info(file_path)
Get audio info using SoX
Source code in src/tnh_scholar/audio_processing/utils/audio_enhance.py
411 412 413 414 415 416 417 418 | |
playback
get_audio_from_file(audio_file)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
22 23 24 | |
get_segment_audio(segment, audio)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
16 17 | |
play_audio_segment(audio)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
19 20 | |
play_bytes(data, format='wav')
Source code in src/tnh_scholar/audio_processing/utils/playback.py
30 31 32 | |
play_diarization_segment(segment, audio)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
13 14 | |
play_from_file(path)
Source code in src/tnh_scholar/audio_processing/utils/playback.py
26 27 28 | |
whisper_security
logger = get_child_logger(__name__)
module-attribute
load_whisper_model(model_name)
Safely load a Whisper model with security best practices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
Name of the Whisper model to load (e.g., "tiny", "base", "small") |
required |
Returns:
| Type | Description |
|---|---|
Any
|
Loaded Whisper model |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If model loading fails |
Source code in src/tnh_scholar/audio_processing/whisper_security.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
safe_torch_load(weights_only=True)
Context manager that temporarily modifies torch.load to use weights_only=True by default.
This addresses the FutureWarning in PyTorch regarding pickle security: https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
weights_only
|
bool
|
If True, limits unpickling to tensor data only. |
True
|
Yields:
| Type | Description |
|---|---|
None
|
None |
Example
with safe_torch_load(): ... model = whisper.load_model("tiny")
Source code in src/tnh_scholar/audio_processing/whisper_security.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
cli_tools
TNH Scholar CLI Tools
Command-line interface tools for the TNH Scholar project:
audio-transcribe:
Audio processing pipeline that handles downloading, segmentation,
and transcription of Buddhist teachings.
tnh-fab:
Text processing tool for texts, providing functionality for
punctuation, sectioning, translation, and pattern-based processing.
See individual tool documentation for usage details and examples.
audio_transcribe
audio_transcribe
CLI tool for downloading audio (YouTube or local), and transcribing to text.
Usage
audio-transcribe [OPTIONS]
e.g. audio-transcribe --yt_url https://www.youtube.com/watch?v=EXAMPLE --output_dir ./processed --service whisper --model whisper-1
DEFAULT_CHUNK_DURATION = 120
module-attribute
DEFAULT_MIN_CHUNK = 10
module-attribute
DEFAULT_MODEL = 'whisper-1'
module-attribute
DEFAULT_OUTPUT_PATH = './audio_transcriptions/transcript.txt'
module-attribute
DEFAULT_RESPONSE_FORMAT = 'text'
module-attribute
DEFAULT_SERVICE = 'whisper'
module-attribute
DEFAULT_TEMP_DIR = tempfile.gettempdir()
module-attribute
VIDEO_EXTENSIONS = {'.mp4', '.avi', '.mov', '.mkv', '.wmv'}
module-attribute
logger = get_child_logger(__name__)
module-attribute
AudioTranscribeApp
Main application class for audio transcription CLI.
Organizes configuration, source resolution, and pipeline execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
yt_url
|
YouTube URL to download audio from. |
required | |
yt_url_csv
|
CSV file containing YouTube URLs. |
required | |
file_
|
Path to local audio file. |
required | |
output_dir
|
Directory for output files. |
required | |
service
|
Transcription service provider. |
required | |
model
|
Transcription model name. |
required | |
language
|
Language code for transcription. |
required | |
response_format
|
Format of transcription response. |
required | |
chunk_duration
|
Target chunk duration (seconds). |
required | |
min_chunk
|
Minimum chunk duration (seconds). |
required | |
start_time
|
Start time offset (HH:MM:SS). |
required | |
end_time
|
End time offset (HH:MM:SS). |
required | |
prompt
|
Prompt or keywords for transcription. |
required |
Source code in src/tnh_scholar/cli_tools/audio_transcribe/audio_transcribe.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 | |
audio_file = self._resolve_audio_source()
instance-attribute
chunk_duration = TimeMs.from_seconds(config.chunk_duration)
instance-attribute
config = config
instance-attribute
diarization_config = self._build_diarization_config()
instance-attribute
end_time = config.end_time
instance-attribute
file_ = config.file_
instance-attribute
keep_artifacts = config.keep_artifacts
instance-attribute
language = config.language
instance-attribute
min_chunk = TimeMs.from_seconds(config.min_chunk)
instance-attribute
model = config.model
instance-attribute
output_path = Path(config.output)
instance-attribute
prompt = config.prompt
instance-attribute
response_format = config.response_format
instance-attribute
service = config.service
instance-attribute
start_time = config.start_time
instance-attribute
temp_dir = self.output_path.parent
instance-attribute
transcription_options = self._build_transcription_options()
instance-attribute
yt_url = config.yt_url
instance-attribute
yt_url_csv = config.yt_url_csv
instance-attribute
__init__(config)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
AudioTranscribeConfig
|
Validated AudioTranscribeConfig instance. |
required |
Source code in src/tnh_scholar/cli_tools/audio_transcribe/audio_transcribe.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
run()
Run the transcription pipeline and print results, or just download audio if no_transcribe is set.
Source code in src/tnh_scholar/cli_tools/audio_transcribe/audio_transcribe.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
audio_transcribe(**kwargs)
CLI entry point for audio transcription.
Source code in src/tnh_scholar/cli_tools/audio_transcribe/audio_transcribe.py
304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 | |
main()
Source code in src/tnh_scholar/cli_tools/audio_transcribe/audio_transcribe.py
384 385 | |
config
DEFAULT_OUTPUT_PATH = './audio_transcriptions/transcript.txt'
module-attribute
DEFAULT_SERVICE = 'whisper'
module-attribute
DEFAULT_TEMP_DIR = './audio_transcriptions/tmp'
module-attribute
AudioTranscribeConfig
Bases: BaseSettings
Source code in src/tnh_scholar/cli_tools/audio_transcribe/config.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
chunk_duration = Field(description='Target chunk duration in seconds')
class-attribute
instance-attribute
end_time = Field(default=None, description='End time offset')
class-attribute
instance-attribute
file_ = Field(default=None, description='Path to local audio file')
class-attribute
instance-attribute
keep_artifacts = Field(default=False, description='Keep all intermediate artifacts in the output directory instead of using a system temp directory.')
class-attribute
instance-attribute
language = Field(default='en', description='Language code')
class-attribute
instance-attribute
min_chunk = Field(ge=10, description='Minimum chunk duration in seconds')
class-attribute
instance-attribute
model = Field(description='Transcription model name')
class-attribute
instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', extra='ignore')
class-attribute
instance-attribute
no_transcribe = Field(default=False, description='If True, only download YouTube audio to mp3, no transcription.')
class-attribute
instance-attribute
output = Field(default=DEFAULT_OUTPUT_PATH, description='Path to output transcript file')
class-attribute
instance-attribute
prompt = Field(default='', description='Prompt or keywords')
class-attribute
instance-attribute
response_format = Field(description='Response format')
class-attribute
instance-attribute
service = Field(default=DEFAULT_SERVICE, pattern='^(whisper|assemblyai)$', description='Transcription service')
class-attribute
instance-attribute
start_time = Field(default=None, description='Start time offset')
class-attribute
instance-attribute
temp_dir = Field(default=None, description='Directory for temporary processing files')
class-attribute
instance-attribute
yt_url = Field(default=None, description='YouTube URL')
class-attribute
instance-attribute
yt_url_csv = Field(default=None, description='CSV file with YouTube URLs')
class-attribute
instance-attribute
validate_sources()
Source code in src/tnh_scholar/cli_tools/audio_transcribe/config.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
MultipleAudioSourceError
Bases: ValueError
Raised when audio source selection has multiple sources).
Source code in src/tnh_scholar/cli_tools/audio_transcribe/config.py
12 13 | |
NoAudioSourceError
Bases: ValueError
Raised when no audio source is provided.
Source code in src/tnh_scholar/cli_tools/audio_transcribe/config.py
9 10 | |
convert_video
FFMPEG_VIDEO_CONV_DEFAULT_CONFIG = {'audio_codec': 'libmp3lame', 'audio_bitrate': '192k', 'audio_samplerate': '44100'}
module-attribute
logger = get_child_logger(__name__)
module-attribute
convert_video_to_audio(video_file, output_dir, conversion_params=None)
Convert a video file to an audio file using ffmpeg.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
video_file
|
Path
|
Path to the video file |
required |
output_dir
|
Path
|
Directory to save the converted audio file |
required |
conversion_params
|
Optional[Dict[str, str]]
|
Optional dictionary to override default conversion parameters |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the converted audio file |
Source code in src/tnh_scholar/cli_tools/audio_transcribe/convert_video.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
environment
env
logger = get_child_logger(__name__)
module-attribute
check_env()
Check the environment for necessary conditions: 1. Check OpenAI key is available. 2. Check that all requirements from requirements.txt are importable.
Source code in src/tnh_scholar/cli_tools/audio_transcribe/environment/env.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
check_requirements(requirements_file)
Check that all requirements listed in requirements.txt can be imported. If any cannot be imported, print a warning.
This is a heuristic check. Some packages may not share the same name as their importable module. Adjust the name mappings below as needed.
Example
check_requirements(Path("./requirements.txt"))
Prints warnings if imports fail, otherwise silent.
Source code in src/tnh_scholar/cli_tools/audio_transcribe/environment/env.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | |
transcription_pipeline
TranscriptionPipeline
Source code in src/tnh_scholar/cli_tools/audio_transcribe/transcription_pipeline.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | |
audio_file = audio_file
instance-attribute
audio_file_extension = audio_file.suffix
instance-attribute
diarization_config = diarization_config or DiarizationConfig()
instance-attribute
diarization_dir = self.output_dir / f'{self.audio_file.stem}_diarization'
instance-attribute
diarization_kwargs = diarization_kwargs or {}
instance-attribute
diarization_results_path = self.diarization_dir / 'raw_diarization_results.json'
instance-attribute
logger = logger or logging.getLogger(__name__)
instance-attribute
output_dir = output_dir
instance-attribute
save_diarization = save_diarization
instance-attribute
transcriber = transcriber
instance-attribute
transcription_options = patch_whisper_options(transcription_options, file_extension=(audio_file.suffix))
instance-attribute
__init__(audio_file, output_dir, diarization_config=None, transcriber='whisper', transcription_options=None, diarization_kwargs=None, save_diarization=True, logger=None)
Initialize the TranscriptionPipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
Path to the audio file to process. |
required |
output_dir
|
Path
|
Directory to store output files. |
required |
diarization_config
|
Optional[DiarizationConfig]
|
Diarization configuration. |
None
|
transcriber
|
str
|
Transcription service provider. |
'whisper'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription. |
None
|
diarization_kwargs
|
Optional[Dict[str, Any]]
|
Additional diarization arguments. |
None
|
save_diarization_json
|
bool
|
Whether to save raw diarization JSON results. |
required |
logger
|
Optional[Logger]
|
Logger for pipeline events. |
None
|
Source code in src/tnh_scholar/cli_tools/audio_transcribe/transcription_pipeline.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
run()
Execute the full transcription pipeline with robust error handling.
Returns:
| Type | Description |
|---|---|
Optional[List[Dict[str, Any]]]
|
List[Dict[str, Any]]: List of transcript dicts with chunk metadata, or None on failure |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If any pipeline step fails. |
Source code in src/tnh_scholar/cli_tools/audio_transcribe/transcription_pipeline.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
validate
validate_inputs(is_download, yt_url, yt_url_list, audio_file, split, transcribe, chunk_dir, no_chunks, silence_boundaries, whisper_boundaries)
Validate the CLI inputs to ensure logical consistency given all the flags.
Conditions & Requirements: 1. At least one action (yt_download, split, transcribe) should be requested. Otherwise, nothing is done, so raise an error.
- If yt_download is True:
-
Must specify either yt_process_url OR yt_process_url_list (not both, not none).
-
If yt_download is False:
- If split is requested, we need a local audio file (since no download will occur).
-
If transcribe is requested without split and without yt_download:
- If no_chunks = False, we must have chunk_dir to read existing chunks.
- If no_chunks = True, we must have a local audio file (direct transcription) or previously downloaded file (but since yt_download=False, previously downloaded file scenario doesn't apply here, so effectively we need local audio in that scenario).
-
no_chunks flag:
-
If no_chunks = True, we are doing direct transcription on entire audio without chunking.
- Cannot use split if no_chunks = True. (Mutually exclusive)
- chunk_dir is irrelevant if no_chunks = True; since we don't split into chunks, requiring a chunk_dir doesn't make sense. If provided, it's not useful, but let's allow it silently or raise an error for clarity. It's safer to raise an error to prevent user confusion.
-
Boundaries flags (silence_boundaries, whisper_boundaries):
- These flags control how splitting is done.
- If split = False, these are irrelevant. Not necessarily an error, but could be a no-op. For robustness, raise an error if user specifies these without split, to avoid confusion.
- If split = True and no_chunks = True, that’s contradictory already, so no need for boundary logic there.
- If split = True, exactly one method should be chosen:
If both silence_boundaries and whisper_boundaries are True simultaneously or both are False simultaneously,
we need a clear default or raise an error. By the code snippet logic, whisper_boundaries is default True
if not stated otherwise. To keep it robust:
- If both are True, raise error.
- If both are False, that means user explicitly turned them off or never turned on whisper. The code snippet sets whisper_boundaries True by default. If user sets it False somehow, we can then default to silence. Just ensure at run-time we have a deterministic method: If both are False, we can default to whisper or silence. Let's default to whisper if no flags given. However, given the code snippet, whisper_boundaries has a default of True. If the user sets whisper_boundaries to False and also does not set silence_boundaries, then no method is chosen. Let's then raise an error if both ended up False to avoid ambiguity.
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input arguments are not logically consistent. |
Source code in src/tnh_scholar/cli_tools/audio_transcribe/validate.py
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
version_check
logger = get_child_logger(__name__)
module-attribute
YTDVersionChecker
Simple version checker for yt-dlp with robust version comparison.
This is a prototype implementation may need expansion in these areas: - Caching to prevent frequent PyPI calls - More comprehensive error handling for: - Missing/uninstalled packages - Network timeouts - JSON parsing errors - Invalid version strings - Environment detection (virtualenv, conda, system Python) - Configuration options for version pinning - Proxy support for network requests
Source code in src/tnh_scholar/cli_tools/audio_transcribe/version_check.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | |
NETWORK_TIMEOUT = 5
class-attribute
instance-attribute
PYPI_URL = 'https://pypi.org/pypi/yt-dlp/json'
class-attribute
instance-attribute
check_version()
Check if yt-dlp needs updating.
Returns:
| Type | Description |
|---|---|
Tuple[bool, Version, Version]
|
Tuple of (needs_update, installed_version, latest_version) |
Raises:
| Type | Description |
|---|---|
ImportError
|
If yt-dlp is not installed |
RequestException
|
For network-related errors |
InvalidVersion
|
If version strings are invalid |
Source code in src/tnh_scholar/cli_tools/audio_transcribe/version_check.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | |
check_ytd_version()
Check if yt-dlp needs updating and log appropriate messages.
This function checks the installed version of yt-dlp against the latest version on PyPI and logs informational or error messages as appropriate. It handles network errors, missing packages, and version parsing issues gracefully.
The function does not raise exceptions but logs them using the application's logging system.
Source code in src/tnh_scholar/cli_tools/audio_transcribe/version_check.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | |
json_to_srt
__all__ = ['main', 'json_to_srt']
module-attribute
main()
Entry point for the jsonl-to-srt CLI tool.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
165 166 167 | |
json_to_srt
Simple CLI tool for converting JSONL transcription files to SRT format.
This module provides a command line interface for transforming JSONL transcription files (from audio-transcribe) into SRT subtitle format. Handles chunked transcriptions with proper timestamp accumulation.
logger = get_child_logger(__name__)
module-attribute
JsonlToSrtConverter
Converts JSONL transcription files from audio-transcribe to SRT format.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | |
accumulated_time = 0.0
instance-attribute
entry_index = 1
instance-attribute
__init__()
Initialize converter state.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
28 29 30 31 | |
build_srt_entry(index, start, end, text)
Format a single SRT entry.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
49 50 51 52 53 | |
convert(input_file, output_file=None)
Convert a JSONL transcription file to SRT format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_file
|
TextIO
|
JSONL transcription file to parse |
required |
output_file
|
Optional[Path]
|
Optional output file path |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
SRT formatted content |
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | |
extract_segment_data(segment)
Extract timestamp and text data from a segment.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
55 56 57 58 59 60 | |
format_timestamp(seconds)
Convert seconds to SRT timestamp format (HH:MM:SS,mmm).
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
33 34 35 36 37 38 39 | |
get_segments_from_data(data)
Extract segments from a data object.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
83 84 85 | |
handle_output(srt_content, output_file)
Write SRT content to file or stdout.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
117 118 119 120 121 122 123 | |
parse_jsonl_line(line)
Parse a single JSONL line into a dictionary.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
41 42 43 44 45 46 47 | |
process_jsonl_content(lines)
Process all JSONL content into SRT format.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
107 108 109 110 111 112 113 114 115 | |
process_jsonl_line(line)
Process a single JSONL line into SRT entries.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | |
process_segment(segment)
Process a single segment into SRT format.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
62 63 64 65 66 67 68 69 70 71 | |
process_segments_list(segments_list)
Process a list of segments into SRT entries.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
73 74 75 76 77 78 79 80 81 | |
read_input_lines(input_file)
Read and filter input lines from file.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
87 88 89 | |
json_to_srt(input_file, output=None)
Convert JSONL transcription files to SRT subtitle format.
Reads from stdin if no INPUT_FILE is specified. Writes to stdout if no output file is specified.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
main()
Entry point for the jsonl-to-srt CLI tool.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt.py
165 166 167 | |
json_to_srt1
Simple CLI tool for converting JSONL transcription files to SRT format.
This module provides a command line interface for transforming JSONL transcription files (from audio-transcribe) into SRT subtitle format.
logger = get_child_logger(__name__)
module-attribute
convert_to_srt(input_file, output_file=None)
Convert a JSONL transcription file to SRT format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_file
|
TextIO
|
JSONL transcription file to parse |
required |
output_file
|
Optional[Path]
|
Optional output file path |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
SRT formatted content |
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
extract_segment_data(segment)
Extract timestamp and text data from a segment.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
44 45 46 47 48 49 | |
format_srt_entry(index, start, end, text)
Format a single SRT entry.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
38 39 40 41 42 | |
format_timestamp(seconds)
Convert seconds to SRT timestamp format (HH:MM:SS,mmm).
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
22 23 24 25 26 27 28 | |
get_segments_from_data(data)
Extract segments from a data object.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
73 74 75 | |
handle_output(srt_content, output_file)
Write SRT content to file or stdout.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
106 107 108 109 110 111 112 | |
json_to_srt(input_file, output=None)
Convert JSONL transcription files to SRT subtitle format.
Reads from stdin if no INPUT_FILE is specified. Writes to stdout if no output file is specified.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | |
main()
Entry point for the jsonl-to-srt CLI tool.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
151 152 153 | |
parse_jsonl_line(line)
Parse a single JSONL line into a dictionary.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
30 31 32 33 34 35 36 | |
process_jsonl_content(lines)
Process all JSONL content into SRT format.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
process_jsonl_line(line, entry_index)
Process a single JSONL line into SRT entries.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
81 82 83 84 85 86 87 88 | |
process_segment(segment, entry_index)
Process a single segment into SRT format.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
51 52 53 54 55 56 57 58 59 | |
process_segments_list(segments_list, entry_index)
Process a list of segments into SRT entries.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
61 62 63 64 65 66 67 68 69 70 71 | |
read_input_lines(input_file)
Read and filter input lines from file.
Source code in src/tnh_scholar/cli_tools/json_to_srt/json_to_srt1.py
77 78 79 | |
nfmt
nfmt
main()
Entry point for the nfmt CLI tool.
Source code in src/tnh_scholar/cli_tools/nfmt/nfmt.py
24 25 26 | |
nfmt(input_file, output, spacing)
Normalize the number of newlines in a text file.
Source code in src/tnh_scholar/cli_tools/nfmt/nfmt.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
sent_split
sent_split
Simple CLI tool for sentence splitting.
This module provides a command line interface for splitting text into sentences. Uses NLTK for robust sentence tokenization. Reads from stdin and writes to stdout by default, with optional file input/output.
SplitConfig
Bases: BaseModel
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
26 27 28 | |
nltk_tokenizer = 'punkt'
class-attribute
instance-attribute
separator = 'newline'
class-attribute
instance-attribute
SplitIOData
Bases: BaseModel
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
content = None
class-attribute
instance-attribute
input_path = None
class-attribute
instance-attribute
output_path = None
class-attribute
instance-attribute
from_io(input_file, output)
classmethod
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
40 41 42 43 44 | |
get_input_content()
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
46 47 48 49 50 | |
write_output(result)
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
52 53 54 55 56 57 58 59 60 | |
SplitResult
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
30 31 32 33 | |
stats = {}
class-attribute
instance-attribute
text_object
instance-attribute
ensure_nltk_data(config)
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
62 63 64 65 66 67 68 69 70 71 72 73 74 | |
main()
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
125 126 | |
sent_split(input_file, output, space)
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | |
split_text(text, config, io_data)
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
sent_split_bak
Simple CLI tool for sentence splitting.
This module provides a command line interface for splitting text into sentences. Uses NLTK for robust sentence tokenization. Reads from stdin and writes to stdout by default, with optional file input/output.
ensure_nltk_data()
Ensure NLTK punkt tokenizer is available.
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split_bak.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
main()
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split_bak.py
100 101 | |
process_text(text, newline=True)
Split text into sentences using NLTK.
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split_bak.py
46 47 48 49 50 51 52 | |
sent_split(input_file, output, space)
Split text into sentences using NLTK's sentence tokenizer.
Reads from stdin if no input file is specified. Writes to stdout if no output file is specified.
Source code in src/tnh_scholar/cli_tools/sent_split/sent_split_bak.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
srt_translate
__all__ = ['main', 'srt_translate']
module-attribute
main()
Entry point for the srt-translate CLI tool.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
262 263 264 | |
srt_translate
CLI tool for translating SRT subtitle files using tnh-scholar line translation.
This module provides a command line interface for translating SRT subtitle files from one language to another while preserving timecodes and subtitle structure. Uses the same translation engine as tnh-fab translate.
logger = get_child_logger(__name__)
module-attribute
SrtEntry
Represents a single subtitle entry from an SRT file.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
end_time = end_time
instance-attribute
index = index
instance-attribute
line_key
property
Generate a unique line key for this entry.
start_time = start_time
instance-attribute
text = text.strip()
instance-attribute
__init__(index, start_time, end_time, text)
Initialize subtitle entry with timing and text.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
31 32 33 34 35 36 | |
__str__()
Format entry as SRT text.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
38 39 40 | |
SrtTranslator
Translates SRT files while preserving timecodes.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | |
metadata = metadata
instance-attribute
model = model
instance-attribute
pattern = pattern
instance-attribute
source_language = source_language
instance-attribute
target_language = target_language
instance-attribute
__init__(source_language=None, target_language='en', pattern=None, model=None, metadata=None)
Initialize translator with language, model settings, and metadata.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
51 52 53 54 55 56 57 58 59 60 61 62 | |
create_text_object(text)
Create a TextObject from the extracted SRT text with metadata.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
87 88 89 90 91 | |
entries_to_numbered_text(entries)
Convert SRT entries to numbered text for TextObject.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
81 82 83 84 85 | |
extract_translated_lines(translated_object)
Extract translated lines from TextObject with line keys.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
format_srt(entries)
Format entries back to SRT content.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
143 144 145 | |
parse_srt(content)
Parse SRT content into structured entries.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | |
translate_and_save(input_file, output_path)
Handles file reading, translation, and saving.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
162 163 164 165 166 167 168 169 170 171 | |
translate_srt(content)
Process SRT content through complete translation pipeline.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
translate_text_object(text_object)
Translate the TextObject using line translation.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
93 94 95 96 97 98 99 100 101 102 103 104 | |
update_entries_with_translations(entries, translations)
Apply translations to original entries.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
130 131 132 133 134 135 136 137 138 139 140 141 | |
load_metadata_from_file(metadata_file)
Load metadata from a file if provided.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
main()
Entry point for the srt-translate CLI tool.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
262 263 264 | |
set_output_path(input_file, output, target_language)
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
183 184 185 186 187 | |
set_pattern(pattern)
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
173 174 175 176 177 178 179 180 181 | |
srt_translate(input_file, output=None, source_language=None, target_language='en', model=None, pattern=None, debug=False, metadata=None)
Translate SRT subtitle files from one language to another.
INPUT_FILE is the path to the SRT file to translate.
Source code in src/tnh_scholar/cli_tools/srt_translate/srt_translate.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 | |
tnh_fab
tnh_fab
TNH-FAB Command Line Interface
Part of the THICH NHAT HANH SCHOLAR (TNH_SCHOLAR) project. A rapid prototype implementation of the TNH-FAB command-line tool for Open AI based text processing. Provides core functionality for text punctuation, sectioning, translation, and general processing.
DEFAULT_SECTION_PATTERN = 'default_section'
module-attribute
DEFAULT_TRANSLATE_PATTERN = 'default_line_translate'
module-attribute
logger = get_child_logger(__name__)
module-attribute
pass_config = click.make_pass_decorator(TNHFabConfig, ensure=True)
module-attribute
TNHFabConfig
Holds configuration for the TNH-FAB CLI tool.
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | |
debug = False
instance-attribute
pattern_manager = PromptCatalog(pattern_dir)
instance-attribute
quiet = False
instance-attribute
verbose = False
instance-attribute
__init__()
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | |
export_processed_sections(section_result, text_obj)
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
458 459 460 461 462 463 464 | |
gen_text_input(ctx, input_file)
Read input from file or stdin.
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
67 68 69 70 71 72 73 | |
get_pattern(pattern_manager, pattern_name)
Get pattern from the pattern manager.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern_manager
|
PromptCatalog
|
Initialized PatternManager instance |
required |
pattern_name
|
str
|
Name of the pattern to load |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Pattern |
Prompt
|
Loaded pattern object |
Raises:
| Type | Description |
|---|---|
ClickException
|
If pattern cannot be loaded |
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | |
main()
Entry point for TNH-FAB CLI tool.
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
466 467 468 | |
process(config, input_file, pattern, section, auto, paragraph, template)
Apply custom pattern-based processing to text with flexible structuring options.
This command provides flexible text processing using customizable patterns. It can process text either by sections (defined in a JSON file or auto-detected), by paragraphs, or can be used to process a text as a whole (this is the default). This is particularly useful for formatting, restructuring, or applying consistent transformations to text.
Examples:
# Process using a specific pattern
$ tnh-fab process -p format_xml input.txt
# Process using paragraph mode
$ tnh-fab process -p format_xml -g input.txt
# Process with custom sections
$ tnh-fab process -p format_xml -s sections.json input.txt
# Process with template values
$ tnh-fab process -p format_xml -t template.yaml input.txt
Processing Modes:
1. Single Input Mode (default)
- Processes entire input.
2. Section Mode (-s):
- Uses sections from a JSON file
- Processes each section according to pattern
3. Paragraph Mode (-g):
- Treats each line/paragraph as a separate unit
- Useful for simpler processing tasks
- More memory efficient for large files
3. Auto Section Mode (-a):
- Automatically sections the input file
- Processes by section
Notes: - Required pattern must exist in pattern directory - Template values can customize pattern behavior
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 | |
punctuate(input_file, language, style, review_count, pattern)
[DEPRECATED] Punctuation command is deprecated.
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
section(config, input_file, language, num_sections, review_count, pattern)
Analyze and divide text into logical sections based on content.
This command processes the input text to identify coherent sections based on content analysis. It generates a structured representation of the text with sections that maintain logical continuity. Each section includes metadata such as title and line range.
Examples:
# Auto-detect sections in a file
$ tnh-fab section input.txt
# Specify desired number of sections
$ tnh-fab section -n 5 input.txt
# Process Vietnamese text with custom pattern
$ tnh-fab section -l vi -p custom_section_pattern input.txt
# Section text from stdin with increased review
$ cat input.txt | tnh-fab section -c 5
Output Format: JSON object containing: - language: Detected or specified language code - sections: Array of section objects, each with: - title: Section title in original language - start_line: Starting line number (inclusive) - end_line: Ending line number (inclusive)
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 | |
tnh_fab(ctx, verbose, debug, quiet)
TNH-FAB: Thich Nhat Hanh Scholar Text processing command-line tool.
CORE COMMANDS: punctuate, section, translate, process
To Get help on any command and see its options:
tnh-fab [COMMAND] --help
Provides specialized processing for multi-lingual Dharma content.
Offers functionalities for punctuation, sectioning, line-based translation, and general text processing based on predefined patterns. Input text can be provided either via a file or standard input.
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | |
translate(config, input_file, language, target, style, context_lines, segment_size, pattern)
Translate text while preserving line numbers and contextual understanding.
This command performs intelligent translation that maintains line number correspondence between source and translated text. It uses surrounding context to improve translation accuracy and consistency, particularly important for texts where terminology and context are crucial.
Examples:
# Translate Vietnamese text to English
$ tnh-fab translate -l vi input.txt
# Translate to French with specific style
$ tnh-fab translate -l vi -r fr -y "Formal" input.txt
# Translate with increased context
$ tnh-fab translate --context-lines 5 input.txt
# Translate using custom segment size
$ tnh-fab translate --segment-size 10 input.txt
Notes: - Line numbers are preserved in the output - Context lines are used to improve translation accuracy - Segment size affects processing speed and memory usage
Source code in src/tnh_scholar/cli_tools/tnh_fab/tnh_fab.py
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | |
tnh_setup
tnh_setup
OPENAI_ENV_HELP_MSG = "\n>>>>>>>>>> OpenAI API key not found in environment. <<<<<<<<<\n\nFor AI processing with TNH-scholar:\n\n1. Get an API key from https://platform.openai.com/api-keys\n2. Set the OPENAI_API_KEY environment variable:\n\n export OPENAI_API_KEY='your-api-key-here' # Linux/Mac\n set OPENAI_API_KEY=your-api-key-here # Windows\n\nFor OpenAI API access help: https://platform.openai.com/\n\n>>>>>>>>>>>>>>>>>>>>>>>>>>> -- <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n"
module-attribute
PATTERNS_URL = 'https://github.com/aaronksolomon/patterns/archive/main.zip'
module-attribute
create_config_dirs()
Create required configuration directories.
Source code in src/tnh_scholar/cli_tools/tnh_setup/tnh_setup.py
39 40 41 42 43 | |
download_patterns()
Download and extract pattern files from GitHub.
Source code in src/tnh_scholar/cli_tools/tnh_setup/tnh_setup.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
main()
Entry point for setup CLI tool.
Source code in src/tnh_scholar/cli_tools/tnh_setup/tnh_setup.py
97 98 99 | |
tnh_setup(skip_env, skip_patterns)
Set up TNH Scholar configuration.
Source code in src/tnh_scholar/cli_tools/tnh_setup/tnh_setup.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
tnh_tree
Developer tool for the tnh-scholar project.
This script generates a directory tree for the entire project and for the src directory, saving the results to 'project_directory_tree.txt' and 'src_directory_tree.txt' respectively.
Uses the generic module generate_tree which has a basic function build_tree that executes tree building.
Exposed as a script via pyproject.toml under the name 'tnh-tree'.
main()
CLI entry point registered as tnh-tree.
Source code in src/tnh_scholar/cli_tools/tnh_tree.py
15 16 17 | |
token_count
token_count
main()
Entry point for the token-count CLI tool.
Source code in src/tnh_scholar/cli_tools/token_count/token_count.py
15 16 17 | |
token_count_cli(input_file)
Return the Open AI API token count of a text file. Based on gpt-4o.
Source code in src/tnh_scholar/cli_tools/token_count/token_count.py
6 7 8 9 10 11 12 | |
ytt_fetch
__all__ = ['main', 'ytt_fetch']
module-attribute
main()
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
166 167 | |
ytt_fetch
Simple CLI tool for retrieving video transcripts.
This module provides a command line interface for downloading video transcripts in specified languages. It uses yt-dlp for video info extraction.
logger = get_child_logger(__name__)
module-attribute
cleanup_files(keep, filepath)
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
154 155 156 157 | |
export_data(output_path, data)
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
159 160 161 162 163 164 | |
export_ttml_data(metadata, ttml_path, no_embed, output_path, keep)
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
generate_metadata(dl, url, keep, output_path)
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
75 76 77 78 79 80 81 82 83 84 | |
generate_transcript(dl, url, lang, keep, no_embed, output_path)
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | |
get_ttml_download(dl, url, lang, output_path)
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 | |
main()
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
166 167 | |
ytt_fetch(url, lang, keep, info, no_embed, output)
YouTube Transcript Fetch: Retrieve and save transcripts for a Youtube video using yt-dlp.
Source code in src/tnh_scholar/cli_tools/ytt_fetch/ytt_fetch.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | |
exceptions
__all__ = ['TnhScholarError', 'ConfigurationError', 'ValidationError', 'ExternalServiceError', 'RateLimitError', 'NotRetryable']
module-attribute
ConfigurationError
Bases: TnhScholarError
Configuration-related errors (missing env vars, invalid settings, etc.).
Source code in src/tnh_scholar/exceptions.py
33 34 | |
ExternalServiceError
Bases: TnhScholarError
Upstream/provider errors (HTTP 5xx, transport, transient provider issues).
Source code in src/tnh_scholar/exceptions.py
41 42 | |
NotRetryable
Bases: TnhScholarError
Marker for errors where retry is known to be pointless (e.g., bad auth).
Source code in src/tnh_scholar/exceptions.py
49 50 | |
RateLimitError
Bases: ExternalServiceError
Upstream rate limits; typically retryable after a backoff.
Source code in src/tnh_scholar/exceptions.py
45 46 | |
TnhScholarError
Bases: Exception
Base exception for all tnh_scholar errors.
Attributes:
| Name | Type | Description |
|---|---|---|
message |
Human-readable summary. |
|
context |
Optional structured context to aid logging/diagnostics. Keep this JSON-serializable. |
|
cause |
Optional underlying exception. |
Source code in src/tnh_scholar/exceptions.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
__cause__ = cause
instance-attribute
context = dict(context) if context else {}
instance-attribute
message = message
instance-attribute
__init__(message='', *, context=None, cause=None)
Source code in src/tnh_scholar/exceptions.py
17 18 19 20 21 22 23 24 25 26 27 | |
__str__()
Source code in src/tnh_scholar/exceptions.py
29 30 | |
ValidationError
Bases: TnhScholarError
Input/data validation errors (precondition failures before calling providers).
Source code in src/tnh_scholar/exceptions.py
37 38 | |
journal_processing
journal_process
BATCH_RETRY_DELAY = 5
module-attribute
DEFAULT_JOURNAL_MODEL = 'gpt-4o'
module-attribute
DEFAULT_MODEL_SETTINGS = {'gpt-4o': {'max_tokens': 16000, 'temperature': 1.0}, 'gpt-3.5-turbo': {'max_tokens': 4096, 'temperature': 1.0}, 'gpt-4o-mini': {'max_tokens': 16000, 'temperature': 1.0}}
module-attribute
MAX_BATCH_RETRIES = 40
module-attribute
MAX_TOKEN_LIMIT = 60000
module-attribute
journal_schema = {'type': 'object', 'properties': {'journal_summary': {'type': 'string'}, 'sections': {'type': 'array', 'items': {'type': 'object', 'properties': {'title_vi': {'type': 'string'}, 'title_en': {'type': 'string'}, 'author': {'type': ['string', 'null']}, 'summary': {'type': 'string'}, 'keywords': {'type': 'array', 'items': {'type': 'string'}}, 'start_page': {'type': 'integer', 'minimum': 1}, 'end_page': {'type': 'integer', 'minimum': 1}}, 'required': ['title_vi', 'title_en', 'summary', 'keywords', 'start_page', 'end_page']}}}, 'required': ['journal_summary', 'sections']}
module-attribute
logger = logging.getLogger('journal_process')
module-attribute
ModelSettings
Bases: TypedDict
Source code in src/tnh_scholar/journal_processing/journal_process.py
25 26 27 | |
max_tokens
instance-attribute
temperature
instance-attribute
batch_section(input_xml_path, batch_jsonl, system_message, journal_name)
Splits the journal content into sections using GPT, with retries for both starting and completing the batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_xml_path
|
str
|
Path to the input XML file. |
required |
output_json_path
|
str
|
Path to save validated metadata JSON. |
required |
raw_output_path
|
str
|
Path to save the raw batch results. |
required |
journal_name
|
str
|
Name of the journal being processed. |
required |
max_retries
|
int
|
Maximum number of retries for batch processing. |
required |
retry_delay
|
int
|
Delay in seconds between retries. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
the result of the batch sectioning process as a serialized json object. |
Source code in src/tnh_scholar/journal_processing/journal_process.py
565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 | |
batch_translate(input_xml_path, batch_json_path, metadata_path, system_message, journal_name)
Translates the journal sections using the GPT model. Saves the translated content back to XML.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_xml_path
|
str
|
Path to the input XML file. |
required |
metadata_path
|
str
|
Path to the metadata JSON file. |
required |
journal_name
|
str
|
Name of the journal. |
required |
xml_output_path
|
str
|
Path to save the translated XML. |
required |
max_retries
|
int
|
Maximum number of retries for batch operations. |
required |
retry_delay
|
int
|
Delay in seconds between retries. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
True if the process succeeds, False otherwise. |
Source code in src/tnh_scholar/journal_processing/journal_process.py
628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 | |
create_jsonl_file_for_batch(messages, output_file_path=None, max_token_list=None, model=DEFAULT_JOURNAL_MODEL, tools=None, json_mode=False)
Write a JSONL batch file mirroring the legacy OpenAI format.
Source code in src/tnh_scholar/journal_processing/journal_process.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | |
deserialize_json(serialized_data)
Converts a serialized JSON string into a Python dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
serialized_data
|
str
|
The JSON string to deserialize. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
The deserialized Python dictionary. |
Source code in src/tnh_scholar/journal_processing/journal_process.py
885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 | |
extract_page_groups_from_metadata(metadata)
Extracts page groups from the section metadata for use with split_xml_pages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
dict
|
The section metadata containing sections with start and end pages. |
required |
Returns:
| Type | Description |
|---|---|
|
List[Tuple[int, int]]: A list of tuples, each representing a page range (start_page, end_page). |
Source code in src/tnh_scholar/journal_processing/journal_process.py
464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 | |
generate_all_batches(processed_document_dir, system_message, user_wrap_function, file_regex='.*\\.xml')
Generate cleaning batches for all journals in the specified directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processed_journals_dir
|
str
|
Path to the directory containing processed journal data. |
required |
system_message
|
str
|
System message template for batch processing. |
required |
user_wrap_function
|
callable
|
Function to wrap user input for processing pages. |
required |
file_regex
|
str
|
Regex pattern to identify target files (default: ".*.xml"). |
'.*\\.xml'
|
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/tnh_scholar/journal_processing/journal_process.py
966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 | |
generate_clean_batch(input_xml_file, output_file, system_message, user_wrap_function)
Generate a batch file for the OpenAI (OA) API using a single input XML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_file
|
str
|
Full path to the input XML file to process. |
required |
output_file
|
str
|
Full path to the output batch JSONL file. |
required |
system_message
|
str
|
System message template for batch processing. |
required |
user_wrap_function
|
callable
|
Function to wrap user input for processing pages. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
Path to the created batch file. |
Raises:
| Type | Description |
|---|---|
Exception
|
If an error occurs during file processing. |
Source code in src/tnh_scholar/journal_processing/journal_process.py
509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 | |
generate_messages(system_message, user_message_wrapper, data_list_to_process, log_system_message=True)
Build OpenAI-style chat message payloads.
Source code in src/tnh_scholar/journal_processing/journal_process.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
generate_single_oa_batch_from_pages(input_xml_file, output_file, system_message, user_wrap_function)
*** Depricated *** Generate a batch file for the OpenAI (OA) API using a single input XML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_file
|
str
|
Full path to the input XML file to process. |
required |
output_file
|
str
|
Full path to the output batch JSONL file. |
required |
system_message
|
str
|
System message template for batch processing. |
required |
user_wrap_function
|
callable
|
Function to wrap user input for processing pages. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
Path to the created batch file. |
Raises:
| Type | Description |
|---|---|
Exception
|
If an error occurs during file processing. |
Source code in src/tnh_scholar/journal_processing/journal_process.py
911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 | |
run_immediate_chat_process(messages, max_tokens=0, response_format=None, model=DEFAULT_JOURNAL_MODEL)
Legacy-compatible immediate completion powered by GenAI simple_completion.
Source code in src/tnh_scholar/journal_processing/journal_process.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | |
save_cleaned_data(cleaned_xml_path, cleaned_wrapped_pages, journal_name)
Source code in src/tnh_scholar/journal_processing/journal_process.py
812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 | |
save_sectioning_data(output_json_path, raw_output_path, serial_json, journal_name)
Source code in src/tnh_scholar/journal_processing/journal_process.py
831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 | |
save_translation_data(xml_output_path, translation_data, journal_name)
Source code in src/tnh_scholar/journal_processing/journal_process.py
866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 | |
send_data_for_tx_batch(batch_jsonl_path, section_data_to_send, system_message, max_token_list, journal_name, immediate=False)
Sends data for translation batch or immediate processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_jsonl_path
|
Path
|
Path for the JSONL file to save batch data. |
required |
section_data_to_send
|
List
|
List of section data to translate. |
required |
system_message
|
str
|
System message for the translation process. |
required |
max_token_list
|
List
|
List of max tokens for each section. |
required |
journal_name
|
str
|
Name of the journal being processed. |
required |
immediate
|
bool
|
If True, run immediate chat processing instead of batch. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
List |
Translated data from the batch or immediate process. |
Source code in src/tnh_scholar/journal_processing/journal_process.py
744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 | |
setup_logger(log_file_path)
Configures the logger to write to a log file and the console. Adds a custom "PRIORITY_INFO" logging level for important messages.
Source code in src/tnh_scholar/journal_processing/journal_process.py
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
start_batch_with_retries(jsonl_file, description='', max_retries=MAX_BATCH_RETRIES, retry_delay=BATCH_RETRY_DELAY, poll_interval=10, timeout=3600)
Simulate the legacy batch runner using sequential simple_completion calls.
The parameters mirror the old interface so callers remain unchanged, but the implementation now iterates through the JSONL requests locally.
Source code in src/tnh_scholar/journal_processing/journal_process.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | |
translate_sections(batch_jsonl_path, system_message, section_contents, section_metadata, journal_name, immediate=False)
build up sections in batches to translate
Source code in src/tnh_scholar/journal_processing/journal_process.py
692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 | |
unwrap_all_lines(pages)
Source code in src/tnh_scholar/journal_processing/journal_process.py
335 336 337 338 339 340 341 342 | |
unwrap_lines(text)
Removes angle brackets (< >) from encapsulated lines and merges them into
a newline-separated string.
Parameters:
text (str): The input string with encapsulated lines.
Returns:
str: A newline-separated string with the encapsulation removed.
Example:
>>> merge_encapsulated_lines("<Line 1> <Line 2> <Line 3>")
'Line 1
Line 2
Line 3'
>>> merge_encapsulated_lines("
Source code in src/tnh_scholar/journal_processing/journal_process.py
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | |
validate_and_clean_data(data, schema)
Recursively validate and clean AI-generated data to fit the given schema. Any missing fields are filled with defaults, and extra fields are ignored.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict
|
The AI-generated data to validate and clean. |
required |
schema
|
dict
|
The schema defining the required structure. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
The cleaned data adhering to the schema. |
Source code in src/tnh_scholar/journal_processing/journal_process.py
346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 | |
validate_and_save_metadata(output_file_path, json_metadata_serial, schema)
Validates and cleans journal data against the schema, then writes it to a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
str
|
The journal data as a serialized JSON string to validate and clean. |
required |
schema
|
dict
|
The schema defining the required structure. |
required |
output_file_path
|
str
|
Path to the output JSON file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
True if successfully written to the file, False otherwise. |
Source code in src/tnh_scholar/journal_processing/journal_process.py
433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 | |
wrap_all_lines(pages)
Source code in src/tnh_scholar/journal_processing/journal_process.py
308 309 | |
wrap_lines(text)
Encloses each line of the input text with angle brackets.
Args:
text (str): The input string containing lines separated by '
'.
Returns:
str: A string where each line is enclosed in angle brackets.
Example:
>>> enclose_lines("This is a string with
two lines.")
'
Source code in src/tnh_scholar/journal_processing/journal_process.py
291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
logging_config
TNH-Scholar Logging Utilities
A production-ready, environment-driven logging system for the TNH-Scholar project. It provides JSON logs in production, color/plain text in development, optional non-blocking queue logging, file rotation, noise suppression for chatty deps, and optional routing of Python warnings into the logging pipeline.
This module is designed for application layer configuration and library layer usage:
- Applications (CLI, Streamlit, FastAPI, notebooks) call :func:
setup_logging. - Libraries / services (e.g., gen_ai_service, IssueHandler) only acquire a
logger via :func:
get_logger(or legacy :func:get_child_logger) and never configure global logging.
Quick start
Application entry point (recommended):
>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging() # reads env; see variables below
>>> log = get_logger(__name__)
>>> log.info("app started", extra={"service": "gen-ai"})
Jupyter / dev (force color in non-TTY):
>>> import os
>>> os.environ["APP_ENV"] = "dev"
>>> os.environ["LOG_JSON"] = "false"
>>> os.environ["LOG_COLOR"] = "true"] # Jupyter isn't a TTY; force color
>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging()
>>> get_logger(__name__).info("hello, color")
Library / service modules (do NOT configure logging):
>>> from tnh_scholar.logging_config import get_logger
>>> log = get_logger(__name__)
>>> log.info("library message")
Behavior by environment
- dev (default):
- Plain or color text to stdout by default.
- Queue logging disabled by default (synchronous).
- Color auto-detects TTY and Jupyter/IPython (can be forced).
- prod:
- JSON logs to stderr by default (suitable for log shippers).
- Queue logging enabled by default (can be disabled).
Environment variables
Most behavior is controlled by environment variables (read when setup_logging()
instantiates :class:LogSettings). Truthy values accept true/1/yes/on
(case-insensitive).
APP_ENV:dev|prod|test(default:dev)LOG_LEVEL: Logging level for the base project logger (default:INFO)LOG_STDOUT: Emit logs to stdout (default:true)LOG_FILE_ENABLE: Emit logs to a file (default:false)LOG_FILE_PATH: File path for logs (default:./logs/main.log)LOG_ROTATE_BYTES: Rotate at N bytes (e.g., 10485760) (default: unset)LOG_ROTATE_WHEN: Timed rotation (e.g.,midnight) (default: unset)LOG_BACKUPS: Number of rotated file backups (default:5)LOG_JSON: Use JSON formatter (recommended in prod) (default:true)LOG_COLOR:true|false|auto(default:auto)LOG_STREAM:stdout|stderr(default:stderr; dev defaults tostdout)LOG_USE_QUEUE: Use QueueHandler/QueueListener (default:true; dev defaults tofalse)LOG_CAPTURE_WARNINGS: Route Python warnings via logging (default:false)LOG_SUPPRESS: Comma-separated list of noisy module names to set to WARNING (default includesurllib3,httpx,openai,uvicorn.*, etc.)
Backward compatibility
get_child_logger(name, console=False, separate_file=False)remains available and can attach ad-hoc console/file handlers without reconfiguring the project base logger. When custom handlers are attached, the child’s propagation is turned off to avoid duplicate messages.setup_logging_legacy(...)forwards to :func:setup_loggingand emits a DeprecationWarning to help locate legacy call sites.-
Custom level
PRIORITY_INFO(25) and :meth:logger.priority_infostill exist but are deprecated. Prefer:log.info("message", extra={"priority": "high"})
This keeps level semantics standard and plays better with structured logging.
Queue logging notes
- When
LOG_USE_QUEUE=true, the base logger uses a :class:QueueHandler. A :class:QueueListeneris started with sinks mirroring your configured stdout/file handlers. This decouples log emission from I/O to minimize latency. -
In notebooks or during debugging, you may prefer synchronous logs:
os.environ["LOG_USE_QUEUE"] = "false"
Python warnings routing
- When
LOG_CAPTURE_WARNINGS=true, Python warnings are captured and logged throughpy.warnings. This module attaches the base logger’s handlers to that logger and disables propagation to avoid duplicate output.
Mixing print() and logging
print()writes to stdout; the logger can write to stdout or stderr depending onLOG_STREAMand environment. Ordering is not guaranteed, especially with queue logging enabled. Prefer logging for consistent output.
Minimal examples
CLI / entrypoint:
>>> import os
>>> os.environ.setdefault("APP_ENV", "prod")
>>> os.environ.setdefault("LOG_JSON", "true")
>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging()
>>> get_logger(__name__).info("ready")
File logging with rotation:
>>> import os
>>> os.environ.update({
... "LOG_FILE_ENABLE": "true",
... "LOG_FILE_PATH": "./logs/app.log",
... "LOG_ROTATE_BYTES": "10485760", # 10MB
... "LOG_BACKUPS": "7",
... })
>>> setup_logging()
>>> get_logger("smoke").info("to file")
Jupyter with color:
>>> import os
>>> os.environ.update({"APP_ENV": "dev", "LOG_JSON": "false", "LOG_COLOR": "true"})
>>> setup_logging()
>>> get_logger(__name__).info("color in notebook")
Notes
- JSON formatting requires
python-json-logger; without it, we fall back to plain/color format automatically. - This module never configures the root logger; it configures the project
base logger (
tnh) so your app can coexist with other libraries cleanly.
BASE_LOG_DIR = Path('./logs')
module-attribute
BASE_LOG_NAME = 'tnh'
module-attribute
DEFAULT_CONSOLE_FORMAT_STRING = LOG_FMT_COLOR
module-attribute
DEFAULT_FILE_FORMAT_STRING = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
module-attribute
DEFAULT_LOG_FILEPATH = Path('main.log')
module-attribute
JsonFormatter = getattr(_pythonjsonlogger_json, 'JsonFormatter', None)
module-attribute
LOG_COLORS = {'DEBUG': 'bold_green', 'INFO': 'cyan', 'PRIORITY_INFO': 'bold_cyan', 'WARNING': 'bold_yellow', 'ERROR': 'bold_red', 'CRITICAL': 'bold_red'}
module-attribute
LOG_FMT_COLOR = '%(asctime)s | %(log_color)s%(levelname)-8s%(reset)s | %(name)s | %(message)s'
module-attribute
LOG_FMT_JSON = '%(asctime)s %(levelname)s %(name)s %(message)s %(process)d %(thread)d %(module)s %(filename)s %(lineno)d'
module-attribute
LOG_FMT_PLAIN = '%(asctime)s | %(levelname)-8s | %(name)s | %(message)s'
module-attribute
MAX_FILE_SIZE = 10 * 1024 * 1024
module-attribute
PRIORITY_INFO_LEVEL = 25
module-attribute
__all__ = ['BASE_LOG_NAME', 'BASE_LOG_DIR', 'DEFAULT_LOG_FILEPATH', 'MAX_FILE_SIZE', 'OMPFilter', 'setup_logging', 'setup_logging_legacy', 'get_logger', 'get_child_logger']
module-attribute
LogSettings
dataclass
Environment-driven logging settings with sensible defaults.
Source code in src/tnh_scholar/logging_config.py
277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 | |
backups = field(default_factory=(lambda: _env_int('LOG_BACKUPS', 5)))
class-attribute
instance-attribute
base_name = field(default_factory=(lambda: _env_str('LOG_BASE', BASE_LOG_NAME)))
class-attribute
instance-attribute
capture_warnings = field(default_factory=(lambda: _env_bool('LOG_CAPTURE_WARNINGS', 'false')))
class-attribute
instance-attribute
colorize = field(default_factory=(lambda: _env_str('LOG_COLOR', 'auto')))
class-attribute
instance-attribute
environment = field(default_factory=(lambda: _env_str('APP_ENV', 'dev')))
class-attribute
instance-attribute
file_path = field(default_factory=(lambda: Path(_env_str('LOG_FILE_PATH', str(BASE_LOG_DIR / DEFAULT_LOG_FILEPATH)))))
class-attribute
instance-attribute
json_format = field(default_factory=(lambda: _env_bool('LOG_JSON', 'true')))
class-attribute
instance-attribute
level = field(default_factory=(lambda: _env_str('LOG_LEVEL', 'INFO')))
class-attribute
instance-attribute
log_stream = field(default_factory=(lambda: _env_str('LOG_STREAM', 'stderr')))
class-attribute
instance-attribute
rotate_bytes = field(default_factory=(lambda: _env_int('LOG_ROTATE_BYTES', 0) or None))
class-attribute
instance-attribute
rotate_when = field(default_factory=(lambda: _env_str('LOG_ROTATE_WHEN', '') or None))
class-attribute
instance-attribute
suppress_modules = field(default_factory=(lambda: _env_str('LOG_SUPPRESS', 'urllib3,httpx,openai,botocore,boto3,asyncio,uvicorn,uvicorn.error,uvicorn.access')))
class-attribute
instance-attribute
to_file = field(default_factory=(lambda: _env_bool('LOG_FILE_ENABLE', 'false')))
class-attribute
instance-attribute
to_stdout = field(default_factory=(lambda: _env_bool('LOG_STDOUT', 'true')))
class-attribute
instance-attribute
use_queue = field(default_factory=(lambda: _env_bool('LOG_USE_QUEUE', 'true')))
class-attribute
instance-attribute
__init__(environment=(lambda: _env_str('APP_ENV', 'dev'))(), base_name=(lambda: _env_str('LOG_BASE', BASE_LOG_NAME))(), level=(lambda: _env_str('LOG_LEVEL', 'INFO'))(), to_stdout=(lambda: _env_bool('LOG_STDOUT', 'true'))(), to_file=(lambda: _env_bool('LOG_FILE_ENABLE', 'false'))(), file_path=(lambda: Path(_env_str('LOG_FILE_PATH', str(BASE_LOG_DIR / DEFAULT_LOG_FILEPATH))))(), rotate_when=(lambda: _env_str('LOG_ROTATE_WHEN', '') or None)(), rotate_bytes=(lambda: _env_int('LOG_ROTATE_BYTES', 0) or None)(), backups=(lambda: _env_int('LOG_BACKUPS', 5))(), json_format=(lambda: _env_bool('LOG_JSON', 'true'))(), colorize=(lambda: _env_str('LOG_COLOR', 'auto'))(), capture_warnings=(lambda: _env_bool('LOG_CAPTURE_WARNINGS', 'false'))(), log_stream=(lambda: _env_str('LOG_STREAM', 'stderr'))(), use_queue=(lambda: _env_bool('LOG_USE_QUEUE', 'true'))(), suppress_modules=(lambda: _env_str('LOG_SUPPRESS', 'urllib3,httpx,openai,botocore,boto3,asyncio,uvicorn,uvicorn.error,uvicorn.access'))())
__post_init__()
Source code in src/tnh_scholar/logging_config.py
343 344 345 346 347 348 349 | |
is_dev()
Source code in src/tnh_scholar/logging_config.py
322 323 | |
selected_stream()
Return the Python stream object to emit logs to (stdout or stderr).
Source code in src/tnh_scholar/logging_config.py
339 340 341 | |
should_color()
Source code in src/tnh_scholar/logging_config.py
325 326 327 328 329 330 331 332 333 334 335 336 337 | |
LoggingConfigurator
Source code in src/tnh_scholar/logging_config.py
354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 | |
settings = settings or LogSettings()
instance-attribute
__init__(settings=None)
Source code in src/tnh_scholar/logging_config.py
399 400 401 402 | |
apply_config(config)
Source code in src/tnh_scholar/logging_config.py
549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 | |
apply_legacy_args(*, log_level, log_filepath, max_log_file_size, backup_count, console)
Source code in src/tnh_scholar/logging_config.py
405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 | |
build_config(*, filters, formatters, handlers)
Source code in src/tnh_scholar/logging_config.py
532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 | |
build_filters()
Source code in src/tnh_scholar/logging_config.py
449 450 | |
build_formatters()
Source code in src/tnh_scholar/logging_config.py
427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 | |
build_handlers(formatters)
Source code in src/tnh_scholar/logging_config.py
452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 | |
configure(*, legacy_args, suppressed_modules)
Source code in src/tnh_scholar/logging_config.py
609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 | |
select_base_handlers(handlers)
Source code in src/tnh_scholar/logging_config.py
520 521 522 523 524 525 526 527 528 529 530 | |
start_queue_listener(handlers)
Source code in src/tnh_scholar/logging_config.py
566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 | |
suppress_noise(modules_override, force=False)
Source code in src/tnh_scholar/logging_config.py
589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 | |
OMPFilter
Bases: Filter
Source code in src/tnh_scholar/logging_config.py
654 655 656 657 | |
filter(record)
Source code in src/tnh_scholar/logging_config.py
655 656 657 | |
UtcFormatter
Bases: Formatter
UTC ISO-8601 timestamps for plain text logging.
Source code in src/tnh_scholar/logging_config.py
265 266 267 268 269 270 271 272 273 274 | |
converter = time.gmtime
class-attribute
instance-attribute
formatTime(record, datefmt=None)
Source code in src/tnh_scholar/logging_config.py
271 272 273 274 | |
get_child_logger(name, console=False, separate_file=False)
Get a child logger that writes logs to a console or a specified file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the child logger (e.g., module name). |
required |
console
|
bool
|
If True, log to the console. If False, do not log to the console. If None, inherit console behavior from the parent logger. |
False
|
file
|
Path
|
A string specifying a logfile to log to. will be placed |
required |
Returns:
| Type | Description |
|---|---|
|
logging.Logger: Configured child logger. |
Source code in src/tnh_scholar/logging_config.py
660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 | |
get_logger(name)
Preferred helper: returns a namespaced logger under the base project name.
Backwards-compatible with existing call sites that used get_child_logger(name).
Source code in src/tnh_scholar/logging_config.py
714 715 716 717 718 719 | |
priority_info(self, message, *args, **kwargs)
Deprecated: use logger.info(msg, extra={"priority": "high"}) instead.
This custom level (25) was introduced for highlighting important informational
events, but it complicates interoperability with external log shippers and
structured log processing. The recommended migration path is to log at the
standard INFO level with an added extra field indicating priority.
Example
logger.info("Important event", extra={"priority": "high"})
Source code in src/tnh_scholar/logging_config.py
195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | |
setup_logging(log_level=logging.INFO, log_filepath=DEFAULT_LOG_FILEPATH, max_log_file_size=MAX_FILE_SIZE, backup_count=5, console=True, suppressed_modules=None, *, settings=None)
Initialize project-wide logging using dictConfig, with JSON in prod and colorized/plain text in dev.
Backward compatible with previous signature. Prefer using env vars or pass a LogSettings via the
keyword-only settings parameter.
Source code in src/tnh_scholar/logging_config.py
626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 | |
setup_logging_legacy(*args, **kwargs)
Deprecated: use setup_logging().
This wrapper preserves old call sites during migration. It emits a DeprecationWarning (once per process) and forwards all arguments to the current setup_logging().
Source code in src/tnh_scholar/logging_config.py
722 723 724 725 726 727 728 729 730 731 732 733 | |
metadata
metadata
JsonValue = Union[str, int, float, bool, list, dict, None]
module-attribute
logger = get_child_logger(__name__)
module-attribute
Frontmatter
Handles YAML frontmatter embedding and extraction.
Source code in src/tnh_scholar/metadata/metadata.py
239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | |
embed(metadata, content)
classmethod
Embed metadata as YAML frontmatter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
Metadata
|
Dictionary of metadata |
required |
content
|
str
|
Content text |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with embedded frontmatter |
Source code in src/tnh_scholar/metadata/metadata.py
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | |
extract(content)
staticmethod
Extract frontmatter and content from text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
Text with optional YAML frontmatter |
required |
Returns:
| Type | Description |
|---|---|
tuple[Metadata, str]
|
Tuple of (metadata object, remaining content) |
Source code in src/tnh_scholar/metadata/metadata.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 | |
extract_from_file(file)
classmethod
Source code in src/tnh_scholar/metadata/metadata.py
261 262 263 264 | |
generate(metadata)
staticmethod
Source code in src/tnh_scholar/metadata/metadata.py
284 285 286 287 288 289 290 291 292 293 | |
Metadata
Bases: MutableMapping
Flexible metadata container that behaves like a dict while ensuring JSON serializability. Designed for AI processing pipelines where schema flexibility is prioritized over structure.
Source code in src/tnh_scholar/metadata/metadata.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | |
process_history
property
Access process history with proper typing.
__delitem__(key)
Source code in src/tnh_scholar/metadata/metadata.py
81 82 | |
__get_pydantic_core_schema__(source_type, handler)
classmethod
Defines the Pydantic core schema for the Metadata class.
This method allows Pydantic to validate Metadata objects as dictionaries.
It handles both direct Metadata instances and dictionaries during validation,
providing flexibility for data input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_type
|
Any
|
The source type being validated. |
required |
handler
|
Callable[[Any], CoreSchema]
|
A callable to handle schema generation for other types. |
required |
Returns:
| Type | Description |
|---|---|
CoreSchema
|
A Pydantic core schema that validates either a Metadata instance |
CoreSchema
|
(by converting it to a dictionary) or a standard dictionary. |
Source code in src/tnh_scholar/metadata/metadata.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |
__getitem__(key)
Source code in src/tnh_scholar/metadata/metadata.py
74 75 | |
__init__(data=None)
Source code in src/tnh_scholar/metadata/metadata.py
50 51 52 53 54 55 56 57 58 59 60 | |
__ior__(other)
Source code in src/tnh_scholar/metadata/metadata.py
105 106 107 108 109 | |
__iter__()
Source code in src/tnh_scholar/metadata/metadata.py
84 85 | |
__len__()
Source code in src/tnh_scholar/metadata/metadata.py
87 88 | |
__or__(other)
Source code in src/tnh_scholar/metadata/metadata.py
94 95 96 97 98 | |
__repr__()
Source code in src/tnh_scholar/metadata/metadata.py
111 112 | |
__ror__(other)
Source code in src/tnh_scholar/metadata/metadata.py
100 101 102 103 | |
__setitem__(key, value)
Process and set value, ensuring JSON serializability.
Source code in src/tnh_scholar/metadata/metadata.py
77 78 79 | |
__str__()
Source code in src/tnh_scholar/metadata/metadata.py
90 91 | |
add_process_info(process_metadata)
Add process metadata to history.
Source code in src/tnh_scholar/metadata/metadata.py
198 199 200 201 202 203 204 | |
copy()
Create a deep copy of the metadata object.
Source code in src/tnh_scholar/metadata/metadata.py
158 159 160 | |
from_dict(data)
classmethod
Create from a plain dict.
Source code in src/tnh_scholar/metadata/metadata.py
153 154 155 156 | |
from_fields(data, fields)
classmethod
Create a Metadata object by extracting specified fields from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict
|
Source dictionary |
required |
fields
|
list[str]
|
List of field names to extract |
required |
Returns:
| Type | Description |
|---|---|
Metadata
|
New Metadata instance with only specified fields |
Source code in src/tnh_scholar/metadata/metadata.py
162 163 164 165 166 167 168 169 170 171 172 173 174 | |
from_yaml(yaml_str)
classmethod
Create Metadata instance from YAML string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
yaml_str
|
str
|
YAML formatted string |
required |
Returns:
| Type | Description |
|---|---|
Metadata
|
New Metadata instance |
Raises:
| Type | Description |
|---|---|
YAMLError
|
If YAML parsing fails |
Source code in src/tnh_scholar/metadata/metadata.py
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | |
text_embed(content)
Source code in src/tnh_scholar/metadata/metadata.py
195 196 | |
to_dict()
Convert to plain dict for JSON serialization.
Source code in src/tnh_scholar/metadata/metadata.py
149 150 151 | |
to_yaml()
Return metadata as YAML formatted string
Source code in src/tnh_scholar/metadata/metadata.py
211 212 213 214 215 216 217 | |
ProcessMetadata
Bases: Metadata
Records information about a specific processing operation.
Source code in src/tnh_scholar/metadata/metadata.py
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | |
__init__(step, processor, tool=None, **additional_params)
Source code in src/tnh_scholar/metadata/metadata.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | |
safe_yaml_load(yaml_str, *, context='unknown')
Source code in src/tnh_scholar/metadata/metadata.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | |
ocr_processing
DEFAULT_ANNOTATION_FONT_PATH = Path('/System/Library/Fonts/Supplemental/Arial.ttf')
module-attribute
DEFAULT_ANNOTATION_FONT_SIZE = 12
module-attribute
DEFAULT_ANNOTATION_LANGUAGE_HINTS = ['vi']
module-attribute
DEFAULT_ANNOTATION_METHOD = 'DOCUMENT_TEXT_DETECTION'
module-attribute
DEFAULT_ANNOTATION_OFFSET = 2
module-attribute
logger = logging.getLogger('ocr_processing')
module-attribute
PDFParseWarning
Bases: Warning
Custom warning class for PDF parsing issues. Encapsulates minimal logic for displaying warnings with a custom format.
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
warn(message)
staticmethod
Display a warning message with custom formatting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
The warning message to display. |
required |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
30 31 32 33 34 35 36 37 38 39 | |
annotate_image_with_text(image, text_annotations, annotation_font_path, font_size=12)
Annotates a PIL image with bounding boxes and text descriptions from OCR results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pil_image
|
Image
|
The input PIL image to annotate. |
required |
text_annotations
|
List[EntityAnnotation]
|
OCR results containing bounding boxes and text. |
required |
annotation_font_path
|
str
|
Path to the font file for text annotations. |
required |
font_size
|
int
|
Font size for text annotations. |
12
|
Returns:
| Type | Description |
|---|---|
Image
|
Image.Image: The annotated PIL image. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input image is None. |
IOError
|
If the font file cannot be loaded. |
Exception
|
For any other unexpected errors. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | |
build_processed_pdf(pdf_path, client, preprocessor=None, annotation_font_path=DEFAULT_ANNOTATION_FONT_PATH)
Processes a PDF document, extracting text, word locations, annotated images, and unannotated images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
Path
|
Path to the PDF file. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
annotation_font_path
|
Path
|
Path to the font file for annotations. |
DEFAULT_ANNOTATION_FONT_PATH
|
Returns:
| Type | Description |
|---|---|
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]
|
Tuple[List[str], List[List[vision.EntityAnnotation]], List[Image.Image], List[Image.Image]]:
- List of extracted full-page texts (one entry per page).
- List of word locations (list of |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified PDF file does not exist. |
ValueError
|
If the PDF file is invalid or contains no pages. |
Exception
|
For any unexpected errors during processing. |
Example
from pathlib import Path from google.cloud import vision pdf_path = Path("/path/to/example.pdf") font_path = Path("/path/to/fonts/Arial.ttf") client = vision.ImageAnnotatorClient() try: text_pages, word_locations_list, annotated_images, unannotated_images = build_processed_pdf( pdf_path, client, font_path ) print(f"Processed {len(text_pages)} pages successfully!") except Exception as e: print(f"Error processing PDF: {e}")
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 | |
deserialize_entity_annotations_from_json(data)
Deserializes JSON data into a nested list of EntityAnnotation objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
str
|
The JSON string containing serialized annotations. |
required |
Returns:
| Type | Description |
|---|---|
List[List[EntityAnnotation]]
|
List[List[EntityAnnotation]]: The reconstructed nested list of EntityAnnotation objects. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 | |
extract_image_from_page(page)
Extracts the first image from the given PDF page and returns it as a PIL Image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
The PDF page object. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Image.Image: The first image on the page as a Pillow Image object. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no images are found on the page or the image data is incomplete. |
Exception
|
For unexpected errors during image extraction. |
Example
import fitz from PIL import Image doc = fitz.open("/path/to/document.pdf") page = doc.load_page(0) # Load the first page try: image = extract_image_from_page(page) image.show() # Display the image except Exception as e: print(f"Error extracting image: {e}")
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | |
get_page_dimensions(page)
Extracts the width and height of a single PDF page in both inches and pixels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
A single PDF page object from PyMuPDF. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
A dictionary containing the width and height of the page in inches and pixels. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
load_pdf_pages(pdf_path)
Opens the PDF document and returns the fitz Document object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
Path
|
The path to the PDF file. |
required |
Returns:
| Type | Description |
|---|---|
Document
|
fitz.Document: The loaded PDF document. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
ValueError
|
If the file is not a valid PDF document. |
Exception
|
For any unexpected error. |
Example
from pathlib import Path pdf_path = Path("/path/to/example.pdf") try: pdf_doc = load_pdf_pages(pdf_path) print(f"PDF contains {pdf_doc.page_count} pages.") except Exception as e: print(f"Error loading PDF: {e}")
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
load_processed_PDF_data(base_path)
Loads processed PDF data from files using metadata for file references.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path
|
Directory where the data is stored (as a Path object). |
required |
base_name
|
str
|
Base name of the processed directory. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]
|
Tuple[List[str], List[List[EntityAnnotation]], List[Image.Image], List[Image.Image]]:
- Loaded text pages.
- Word locations (list of |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If any required files are missing. |
ValueError
|
If the metadata file is incomplete or invalid. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 | |
make_image_preprocess_mask(mask_height)
Creates a preprocessing function that masks a specified height at the bottom of the image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask_height
|
float
|
The proportion of the image height to mask at the bottom (0.0 to 1.0). |
required |
Returns:
| Type | Description |
|---|---|
Callable[[Image, int], Image]
|
Callable[[Image.Image, int], Image.Image]: A preprocessing function that takes an image |
Callable[[Image, int], Image]
|
and page number as input and returns the processed image. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 | |
pil_to_bytes(image, format='PNG')
Converts a Pillow image to raw bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The Pillow image object to convert. |
required |
format
|
str
|
The format to save the image as (e.g., "PNG", "JPEG"). Default is "PNG". |
'PNG'
|
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
The raw bytes of the image. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
process_page(page, client, annotation_font_path, preprocessor=None)
Processes a single PDF page, extracting text, word locations, and annotated images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
The PDF page object. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
pre_processor
|
Callable[[Image, int], Image]
|
Preprocessing function for the image. |
required |
annotation_font_path
|
str
|
Path to the font file for annotations. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[str, List[EntityAnnotation], Image, Image, dict]
|
Tuple[str, List[vision.EntityAnnotation], Image.Image, Image.Image, dict]: - Full page text (str) - Word locations (List of vision.EntityAnnotation) - Annotated image (Pillow Image object) - Original unprocessed image (Pillow Image object) - Page dimensions (dict) |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 | |
process_single_image(image, client, feature_type=DEFAULT_ANNOTATION_METHOD, language_hints=DEFAULT_ANNOTATION_LANGUAGE_HINTS)
Processes a single image with the Google Vision API and returns text annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The preprocessed Pillow image object. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
feature_type
|
str
|
Type of text detection to use ('TEXT_DETECTION' or 'DOCUMENT_TEXT_DETECTION'). |
DEFAULT_ANNOTATION_METHOD
|
language_hints
|
List
|
Language hints for OCR. |
DEFAULT_ANNOTATION_LANGUAGE_HINTS
|
Returns:
| Type | Description |
|---|---|
List[EntityAnnotation]
|
List[vision.EntityAnnotation]: Text annotations from the Vision API response. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no text is detected. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 | |
save_processed_pdf_data(output_dir, journal_name, text_pages, word_locations, annotated_images, unannotated_images)
Saves processed PDF data to files for later reloading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path
|
Directory to save the data (as a Path object). |
required |
base_name
|
str
|
Base name for the output directory (usually the PDF name without extension). |
required |
text_pages
|
List[str]
|
Extracted full-page text. |
required |
word_locations
|
List[List[EntityAnnotation]]
|
Word locations and annotations from Vision API. |
required |
annotated_images
|
List[Image]
|
Annotated images with bounding boxes. |
required |
unannotated_images
|
List[Image]
|
Raw unannotated images. |
required |
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 | |
serialize_entity_annotations_to_json(annotations)
Serializes a nested list of EntityAnnotation objects into a JSON-compatible format using Base64 encoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
List[List[EntityAnnotation]]
|
The nested list of EntityAnnotation objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The serialized data in JSON format as a string. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 | |
start_image_annotator_client(credentials_file=None, api_endpoint='vision.googleapis.com', timeout=(10, 30), enable_logging=False)
Starts and returns a Google Vision API ImageAnnotatorClient with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
credentials_file
|
str
|
Path to the credentials JSON file. If None, uses the default environment variable. |
None
|
api_endpoint
|
str
|
Custom API endpoint for the Vision API. Default is the global endpoint. |
'vision.googleapis.com'
|
timeout
|
Tuple[int, int]
|
Connection and read timeouts in seconds. Default is (10, 30). |
(10, 30)
|
enable_logging
|
bool
|
Enable detailed logging for debugging. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
ImageAnnotatorClient
|
vision.ImageAnnotatorClient: Configured Vision API client. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified credentials file is not found. |
Exception
|
For unexpected errors during client setup. |
Example
client = start_image_annotator_client( credentials_file="/path/to/credentials.json", api_endpoint="vision.googleapis.com", timeout=(10, 30), enable_logging=True ) print("Google Vision API client initialized.")
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |
ocr_editor
current_image = st.session_state.current_image
module-attribute
current_page_index = st.session_state.current_page_index
module-attribute
current_text = pages[current_page_index]
module-attribute
edited_text = st.text_area('Edit OCR Text', value=(st.session_state.current_text), key=f'text_area_{st.session_state.current_page_index}', height=400)
module-attribute
image_directory = st.sidebar.text_input('Image Directory', value='./images')
module-attribute
ocr_text_directory = st.sidebar.text_input('OCR Text Directory', value='./ocr_text')
module-attribute
pages = st.session_state.pages
module-attribute
save_path = os.path.join(ocr_text_directory, 'updated_ocr.xml')
module-attribute
tree = st.session_state.tree
module-attribute
uploaded_image_file = st.sidebar.file_uploader('Upload an Image', type=['jpg', 'jpeg', 'png', 'pdf'])
module-attribute
uploaded_text_file = st.sidebar.file_uploader('Upload OCR Text File', type=['xml'])
module-attribute
extract_pages(tree)
Extract page data from the XML tree.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree
|
ElementTree
|
Parsed XML tree. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
A list of dictionaries containing 'number' and 'text' for each page. |
Source code in src/tnh_scholar/ocr_processing/ocr_editor.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
load_xml(file_obj)
Load an XML file from a file-like object.
Source code in src/tnh_scholar/ocr_processing/ocr_editor.py
28 29 30 31 32 33 34 35 36 37 | |
save_xml(tree, file_path)
Save the modified XML tree to a file.
Source code in src/tnh_scholar/ocr_processing/ocr_editor.py
41 42 43 44 45 46 | |
ocr_processing
DEFAULT_ANNOTATION_FONT_PATH = Path('/System/Library/Fonts/Supplemental/Arial.ttf')
module-attribute
DEFAULT_ANNOTATION_FONT_SIZE = 12
module-attribute
DEFAULT_ANNOTATION_LANGUAGE_HINTS = ['vi']
module-attribute
DEFAULT_ANNOTATION_METHOD = 'DOCUMENT_TEXT_DETECTION'
module-attribute
DEFAULT_ANNOTATION_OFFSET = 2
module-attribute
logger = logging.getLogger('ocr_processing')
module-attribute
PDFParseWarning
Bases: Warning
Custom warning class for PDF parsing issues. Encapsulates minimal logic for displaying warnings with a custom format.
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
warn(message)
staticmethod
Display a warning message with custom formatting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
The warning message to display. |
required |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
30 31 32 33 34 35 36 37 38 39 | |
annotate_image_with_text(image, text_annotations, annotation_font_path, font_size=12)
Annotates a PIL image with bounding boxes and text descriptions from OCR results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pil_image
|
Image
|
The input PIL image to annotate. |
required |
text_annotations
|
List[EntityAnnotation]
|
OCR results containing bounding boxes and text. |
required |
annotation_font_path
|
str
|
Path to the font file for text annotations. |
required |
font_size
|
int
|
Font size for text annotations. |
12
|
Returns:
| Type | Description |
|---|---|
Image
|
Image.Image: The annotated PIL image. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input image is None. |
IOError
|
If the font file cannot be loaded. |
Exception
|
For any other unexpected errors. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | |
build_processed_pdf(pdf_path, client, preprocessor=None, annotation_font_path=DEFAULT_ANNOTATION_FONT_PATH)
Processes a PDF document, extracting text, word locations, annotated images, and unannotated images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
Path
|
Path to the PDF file. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
annotation_font_path
|
Path
|
Path to the font file for annotations. |
DEFAULT_ANNOTATION_FONT_PATH
|
Returns:
| Type | Description |
|---|---|
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]
|
Tuple[List[str], List[List[vision.EntityAnnotation]], List[Image.Image], List[Image.Image]]:
- List of extracted full-page texts (one entry per page).
- List of word locations (list of |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified PDF file does not exist. |
ValueError
|
If the PDF file is invalid or contains no pages. |
Exception
|
For any unexpected errors during processing. |
Example
from pathlib import Path from google.cloud import vision pdf_path = Path("/path/to/example.pdf") font_path = Path("/path/to/fonts/Arial.ttf") client = vision.ImageAnnotatorClient() try: text_pages, word_locations_list, annotated_images, unannotated_images = build_processed_pdf( pdf_path, client, font_path ) print(f"Processed {len(text_pages)} pages successfully!") except Exception as e: print(f"Error processing PDF: {e}")
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 | |
deserialize_entity_annotations_from_json(data)
Deserializes JSON data into a nested list of EntityAnnotation objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
str
|
The JSON string containing serialized annotations. |
required |
Returns:
| Type | Description |
|---|---|
List[List[EntityAnnotation]]
|
List[List[EntityAnnotation]]: The reconstructed nested list of EntityAnnotation objects. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 | |
extract_image_from_page(page)
Extracts the first image from the given PDF page and returns it as a PIL Image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
The PDF page object. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Image.Image: The first image on the page as a Pillow Image object. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no images are found on the page or the image data is incomplete. |
Exception
|
For unexpected errors during image extraction. |
Example
import fitz from PIL import Image doc = fitz.open("/path/to/document.pdf") page = doc.load_page(0) # Load the first page try: image = extract_image_from_page(page) image.show() # Display the image except Exception as e: print(f"Error extracting image: {e}")
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | |
get_page_dimensions(page)
Extracts the width and height of a single PDF page in both inches and pixels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
A single PDF page object from PyMuPDF. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
A dictionary containing the width and height of the page in inches and pixels. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
load_pdf_pages(pdf_path)
Opens the PDF document and returns the fitz Document object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
Path
|
The path to the PDF file. |
required |
Returns:
| Type | Description |
|---|---|
Document
|
fitz.Document: The loaded PDF document. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
ValueError
|
If the file is not a valid PDF document. |
Exception
|
For any unexpected error. |
Example
from pathlib import Path pdf_path = Path("/path/to/example.pdf") try: pdf_doc = load_pdf_pages(pdf_path) print(f"PDF contains {pdf_doc.page_count} pages.") except Exception as e: print(f"Error loading PDF: {e}")
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
load_processed_PDF_data(base_path)
Loads processed PDF data from files using metadata for file references.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path
|
Directory where the data is stored (as a Path object). |
required |
base_name
|
str
|
Base name of the processed directory. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]
|
Tuple[List[str], List[List[EntityAnnotation]], List[Image.Image], List[Image.Image]]:
- Loaded text pages.
- Word locations (list of |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If any required files are missing. |
ValueError
|
If the metadata file is incomplete or invalid. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 | |
make_image_preprocess_mask(mask_height)
Creates a preprocessing function that masks a specified height at the bottom of the image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask_height
|
float
|
The proportion of the image height to mask at the bottom (0.0 to 1.0). |
required |
Returns:
| Type | Description |
|---|---|
Callable[[Image, int], Image]
|
Callable[[Image.Image, int], Image.Image]: A preprocessing function that takes an image |
Callable[[Image, int], Image]
|
and page number as input and returns the processed image. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 | |
pil_to_bytes(image, format='PNG')
Converts a Pillow image to raw bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The Pillow image object to convert. |
required |
format
|
str
|
The format to save the image as (e.g., "PNG", "JPEG"). Default is "PNG". |
'PNG'
|
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
The raw bytes of the image. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
process_page(page, client, annotation_font_path, preprocessor=None)
Processes a single PDF page, extracting text, word locations, and annotated images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
The PDF page object. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
pre_processor
|
Callable[[Image, int], Image]
|
Preprocessing function for the image. |
required |
annotation_font_path
|
str
|
Path to the font file for annotations. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[str, List[EntityAnnotation], Image, Image, dict]
|
Tuple[str, List[vision.EntityAnnotation], Image.Image, Image.Image, dict]: - Full page text (str) - Word locations (List of vision.EntityAnnotation) - Annotated image (Pillow Image object) - Original unprocessed image (Pillow Image object) - Page dimensions (dict) |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 | |
process_single_image(image, client, feature_type=DEFAULT_ANNOTATION_METHOD, language_hints=DEFAULT_ANNOTATION_LANGUAGE_HINTS)
Processes a single image with the Google Vision API and returns text annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The preprocessed Pillow image object. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
feature_type
|
str
|
Type of text detection to use ('TEXT_DETECTION' or 'DOCUMENT_TEXT_DETECTION'). |
DEFAULT_ANNOTATION_METHOD
|
language_hints
|
List
|
Language hints for OCR. |
DEFAULT_ANNOTATION_LANGUAGE_HINTS
|
Returns:
| Type | Description |
|---|---|
List[EntityAnnotation]
|
List[vision.EntityAnnotation]: Text annotations from the Vision API response. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no text is detected. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 | |
save_processed_pdf_data(output_dir, journal_name, text_pages, word_locations, annotated_images, unannotated_images)
Saves processed PDF data to files for later reloading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path
|
Directory to save the data (as a Path object). |
required |
base_name
|
str
|
Base name for the output directory (usually the PDF name without extension). |
required |
text_pages
|
List[str]
|
Extracted full-page text. |
required |
word_locations
|
List[List[EntityAnnotation]]
|
Word locations and annotations from Vision API. |
required |
annotated_images
|
List[Image]
|
Annotated images with bounding boxes. |
required |
unannotated_images
|
List[Image]
|
Raw unannotated images. |
required |
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 | |
serialize_entity_annotations_to_json(annotations)
Serializes a nested list of EntityAnnotation objects into a JSON-compatible format using Base64 encoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
List[List[EntityAnnotation]]
|
The nested list of EntityAnnotation objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The serialized data in JSON format as a string. |
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 | |
start_image_annotator_client(credentials_file=None, api_endpoint='vision.googleapis.com', timeout=(10, 30), enable_logging=False)
Starts and returns a Google Vision API ImageAnnotatorClient with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
credentials_file
|
str
|
Path to the credentials JSON file. If None, uses the default environment variable. |
None
|
api_endpoint
|
str
|
Custom API endpoint for the Vision API. Default is the global endpoint. |
'vision.googleapis.com'
|
timeout
|
Tuple[int, int]
|
Connection and read timeouts in seconds. Default is (10, 30). |
(10, 30)
|
enable_logging
|
bool
|
Enable detailed logging for debugging. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
ImageAnnotatorClient
|
vision.ImageAnnotatorClient: Configured Vision API client. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified credentials file is not found. |
Exception
|
For unexpected errors during client setup. |
Example
client = start_image_annotator_client( credentials_file="/path/to/credentials.json", api_endpoint="vision.googleapis.com", timeout=(10, 30), enable_logging=True ) print("Google Vision API client initialized.")
Source code in src/tnh_scholar/ocr_processing/ocr_processing.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |
text_processing
__all__ = ['bracket_lines', 'unbracket_lines', 'lines_from_bracketed_text', 'NumberedText', 'normalize_newlines', 'clean_text']
module-attribute
NumberedText
Represents a text document with numbered lines for easy reference and manipulation.
Provides utilities for working with line-numbered text including reading, writing, accessing lines by number, and iterating over numbered lines.
Attributes:
| Name | Type | Description |
|---|---|---|
lines |
List[str]
|
List of text lines |
start |
int
|
Starting line number (default: 1) |
separator |
str
|
Separator between line number and content (default: ": ") |
Examples:
>>> text = "First line\nSecond line\n\nFourth line"
>>> doc = NumberedText(text)
>>> print(doc)
1: First line
2: Second line
3:
4: Fourth line
>>> print(doc.get_line(2))
Second line
>>> for num, line in doc:
... print(f"Line {num}: {len(line)} chars")
Source code in src/tnh_scholar/text_processing/numbered_text.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 | |
content
property
Get original text without line numbers.
end
property
lines = []
instance-attribute
numbered_content
property
Get text with line numbers as a string. Equivalent to str(self)
numbered_lines
property
Get list of lines with line numbers included.
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: Lines with numbers and separator prefixed |
Examples:
>>> doc = NumberedText("First line\nSecond line")
>>> doc.numbered_lines
['1: First line', '2: Second line']
Note
- Unlike str(self), this returns a list rather than joined string
- Maintains consistent formatting with separator
- Useful for processing or displaying individual numbered lines
separator = separator
instance-attribute
size
property
Get the number of lines.
start = start
instance-attribute
LineSegment
dataclass
Represents a segment of lines with start and end indices in 1-based indexing.
The segment follows Python range conventions where start is inclusive and end is exclusive. However, indexing is 1-based to match NumberedText.
Attributes:
| Name | Type | Description |
|---|---|---|
start |
int
|
Starting line number (inclusive, 1-based) |
end |
int
|
Ending line number (exclusive, 1-based) |
Source code in src/tnh_scholar/text_processing/numbered_text.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |
end
instance-attribute
start
instance-attribute
__init__(start, end)
__iter__()
Allow unpacking into start, end pairs.
Source code in src/tnh_scholar/text_processing/numbered_text.py
58 59 60 61 | |
SegmentIterator
Iterator for generating line segments of specified size.
Produces segments of lines with start/end indices following 1-based indexing. The final segment may be smaller than the specified segment size.
Attributes:
| Name | Type | Description |
|---|---|---|
total_lines |
Total number of lines in text |
|
segment_size |
Number of lines per segment |
|
start_line |
Starting line number (1-based) |
|
min_segment_size |
Minimum size for the final segment |
Source code in src/tnh_scholar/text_processing/numbered_text.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
min_segment_size = min_segment_size
instance-attribute
num_segments = (remaining_lines + segment_size - 1) // segment_size
instance-attribute
segment_size = segment_size
instance-attribute
start_line = start_line
instance-attribute
total_lines = total_lines
instance-attribute
__init__(total_lines, segment_size, start_line=1, min_segment_size=None)
Initialize the segment iterator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
total_lines
|
int
|
Total number of lines to iterate over |
required |
segment_size
|
int
|
Desired size of each segment |
required |
start_line
|
int
|
First line number (default: 1) |
1
|
min_segment_size
|
Optional[int]
|
Minimum size for final segment (default: None) If specified, the last segment will be merged with the previous one if it would be smaller than this size. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If segment_size < 1 or total_lines < 1 |
ValueError
|
If start_line < 1 (must use 1-based indexing) |
ValueError
|
If min_segment_size >= segment_size |
Source code in src/tnh_scholar/text_processing/numbered_text.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | |
__iter__()
Iterate over line segments.
Yields:
| Type | Description |
|---|---|
LineSegment
|
LineSegment containing start (inclusive) and end (exclusive) indices |
Source code in src/tnh_scholar/text_processing/numbered_text.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
__getitem__(index)
Get line content by line number (1-based indexing).
Source code in src/tnh_scholar/text_processing/numbered_text.py
251 252 253 | |
__init__(content=None, start=1, separator=':')
Initialize a numbered text document, detecting and preserving existing numbering.
Valid numbered text must have: - Sequential line numbers - Consistent separator character(s) - Every non-empty line must follow the numbering pattern
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
Optional[str]
|
Initial text content, if any |
None
|
start
|
int
|
Starting line number (used only if content isn't already numbered) |
1
|
separator
|
str
|
Separator between line numbers and content |
':'
|
Examples:
>>> # Custom separators
>>> doc = NumberedText("1→First line\n2→Second line")
>>> doc.separator == "→"
True
>>> # Preserves starting number
>>> doc = NumberedText("5#First\n6#Second")
>>> doc.start == 5
True
>>> # Regular numbered list isn't treated as line numbers
>>> doc = NumberedText("1. First item\n2. Second item")
>>> doc.numbered_lines
['1: 1. First item', '2: 2. Second item']
Source code in src/tnh_scholar/text_processing/numbered_text.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | |
__iter__()
Iterate over (line_number, line_content) pairs.
Source code in src/tnh_scholar/text_processing/numbered_text.py
247 248 249 | |
__len__()
Return the number of lines.
Source code in src/tnh_scholar/text_processing/numbered_text.py
243 244 245 | |
__str__()
Return the numbered text representation.
Source code in src/tnh_scholar/text_processing/numbered_text.py
237 238 239 240 241 | |
append(text)
Append text, splitting into lines if needed.
Source code in src/tnh_scholar/text_processing/numbered_text.py
328 329 330 | |
from_file(path, **kwargs)
classmethod
Create a NumberedText instance from a file.
Source code in src/tnh_scholar/text_processing/numbered_text.py
218 219 220 221 | |
get_line(line_num)
Get content of specified line number.
Source code in src/tnh_scholar/text_processing/numbered_text.py
255 256 257 | |
get_lines(start, end)
Get content of line range, not inclusive of end line.
Source code in src/tnh_scholar/text_processing/numbered_text.py
267 268 269 | |
get_numbered_line(line_num)
Get specified line with line number.
Source code in src/tnh_scholar/text_processing/numbered_text.py
262 263 264 265 | |
get_numbered_lines(start, end)
Source code in src/tnh_scholar/text_processing/numbered_text.py
271 272 273 274 275 | |
get_numbered_segment(start, end)
Source code in src/tnh_scholar/text_processing/numbered_text.py
314 315 | |
get_segment(start, end)
return the segment from start line (inclusive) up to end line (exclusive)
Source code in src/tnh_scholar/text_processing/numbered_text.py
276 277 278 279 280 281 282 283 284 | |
insert(line_num, text)
Insert text at specified line number. Assumes text is not empty.
Source code in src/tnh_scholar/text_processing/numbered_text.py
332 333 334 335 336 | |
iter_segments(segment_size, min_segment_size=None)
Iterate over segments of the text with specified size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segment_size
|
int
|
Number of lines per segment |
required |
min_segment_size
|
Optional[int]
|
Optional minimum size for final segment. If specified, last segment will be merged with previous one if it would be smaller than this size. |
None
|
Yields:
| Type | Description |
|---|---|
LineSegment
|
LineSegment objects containing start and end line numbers |
Example
text = NumberedText("line1\nline2\nline3\nline4\nline5") for segment in text.iter_segments(2): ... print(f"Lines {segment.start}-{segment.end}") Lines 1-3 Lines 3-5 Lines 5-6
Source code in src/tnh_scholar/text_processing/numbered_text.py
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 | |
remove_whitespace()
Remove leading and trailing whitespace from all lines.
Source code in src/tnh_scholar/text_processing/numbered_text.py
341 342 343 | |
reset_numbering()
Source code in src/tnh_scholar/text_processing/numbered_text.py
338 339 | |
save(path, numbered=True)
Save document to file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path |
required |
numbered
|
bool
|
Whether to save with line numbers (default: True) |
True
|
Source code in src/tnh_scholar/text_processing/numbered_text.py
317 318 319 320 321 322 323 324 325 326 | |
bracket_lines(text, number=False)
Encloses each line of the input text with angle brackets.
If number is True, adds a line number followed by a colon `:` and then the line.
Args:
text (str): The input string containing lines separated by '
'. number (bool): Whether to prepend line numbers to each line.
Returns:
str: A string where each line is enclosed in angle brackets.
Examples:
>>> bracket_lines("This is a string with
two lines.")
'
>>> bracket_lines("This is a string with
two lines.", number=True) '<1:This is a string with> <2: two lines.>'
Source code in src/tnh_scholar/text_processing/bracket.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
clean_text(text, newline=False)
Cleans a given text by replacing specific unwanted characters such as tab, and non-breaking spaces with regular spaces.
This function takes a string as input and applies replacements based on a predefined mapping of characters to replace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to be cleaned. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
The cleaned text with unwanted characters replaced by spaces. |
Example
text = "This is\n an example\ttext with\xa0extra spaces." clean_text(text) 'This is an example text with extra spaces.'
Source code in src/tnh_scholar/text_processing/text_processing.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
lines_from_bracketed_text(text, start, end, keep_brackets=False)
Extracts lines from bracketed text between the start and end indices, inclusive.
Handles both numbered and non-numbered cases.
Args:
text (str): The input bracketed text containing lines like <...>.
start (int): The starting line number (1-based).
end (int): The ending line number (1-based).
Returns:
list[str]: The lines from start to end inclusive, with angle brackets removed.
Raises:
FormattingError: If the text contains improperly formatted lines (missing angle brackets).
ValueError: If start or end indices are invalid or out of bounds.
Examples:
>>> text = "<1:Line 1>
<2:Line 2> <3:Line 3>" >>> lines_from_bracketed_text(text, 1, 2) ['Line 1', 'Line 2']
>>> text = "<Line 1>
Source code in src/tnh_scholar/text_processing/bracket.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
normalize_newlines(text, spacing=2)
Normalize newline blocks in the input text by reducing consecutive newlines
to the specified number of newlines for consistent readability and formatting.
Parameters:
----------
text : str
The input text containing inconsistent newline spacing.
spacing : int, optional
The number of newlines to insert between lines. Defaults to 2.
Returns:
-------
str
The text with consecutive newlines reduced to the specified number of newlines.
Example:
--------
>>> raw_text = "Heading
Paragraph text 1 Paragraph text 2
" >>> normalize_newlines(raw_text, spacing=2) 'Heading
Paragraph text 1
Paragraph text 2
'
Source code in src/tnh_scholar/text_processing/text_processing.py
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
unbracket_lines(text, number=False)
Removes angle brackets (< >) from encapsulated lines and optionally removes line numbers.
Args:
text (str): The input string with encapsulated lines.
number (bool): If True, removes line numbers in the format 'digit:'.
Raises a ValueError if `number=True` and a line does not start with a digit followed by a colon.
Returns:
str: A newline-separated string with the encapsulation removed, and line numbers stripped if specified.
Examples:
>>> unbracket_lines("<1:Line 1>
<2:Line 2>", number=True) 'Line 1 Line 2'
>>> unbracket_lines("<Line 1>
>>> unbracket_lines("<1Line 1>", number=True)
ValueError: Line does not start with a valid number: '1Line 1'
Source code in src/tnh_scholar/text_processing/bracket.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
bracket
FormattingError
Bases: Exception
Custom exception raised for formatting-related errors.
Source code in src/tnh_scholar/text_processing/bracket.py
5 6 7 8 9 10 11 | |
__init__(message='An error occurred due to invalid formatting.')
Source code in src/tnh_scholar/text_processing/bracket.py
10 11 | |
bracket_all_lines(pages)
Source code in src/tnh_scholar/text_processing/bracket.py
78 79 | |
bracket_lines(text, number=False)
Encloses each line of the input text with angle brackets.
If number is True, adds a line number followed by a colon `:` and then the line.
Args:
text (str): The input string containing lines separated by '
'. number (bool): Whether to prepend line numbers to each line.
Returns:
str: A string where each line is enclosed in angle brackets.
Examples:
>>> bracket_lines("This is a string with
two lines.")
'
>>> bracket_lines("This is a string with
two lines.", number=True) '<1:This is a string with> <2: two lines.>'
Source code in src/tnh_scholar/text_processing/bracket.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
lines_from_bracketed_text(text, start, end, keep_brackets=False)
Extracts lines from bracketed text between the start and end indices, inclusive.
Handles both numbered and non-numbered cases.
Args:
text (str): The input bracketed text containing lines like <...>.
start (int): The starting line number (1-based).
end (int): The ending line number (1-based).
Returns:
list[str]: The lines from start to end inclusive, with angle brackets removed.
Raises:
FormattingError: If the text contains improperly formatted lines (missing angle brackets).
ValueError: If start or end indices are invalid or out of bounds.
Examples:
>>> text = "<1:Line 1>
<2:Line 2> <3:Line 3>" >>> lines_from_bracketed_text(text, 1, 2) ['Line 1', 'Line 2']
>>> text = "<Line 1>
Source code in src/tnh_scholar/text_processing/bracket.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
number_lines(text, start=1, separator=': ')
Numbers each line of text with a readable format, including empty lines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to be numbered. Can be multi-line. |
required |
start
|
int
|
Starting line number. Defaults to 1. |
1
|
separator
|
str
|
Separator between line number and content. Defaults to ": ". |
': '
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Numbered text where each line starts with "{number}: ". |
Examples:
>>> text = "First line\nSecond line\n\nFourth line"
>>> print(number_lines(text))
1: First line
2: Second line
3:
4: Fourth line
>>> print(number_lines(text, start=5, separator=" | "))
5 | First line
6 | Second line
7 |
8 | Fourth line
Notes
- All lines are numbered, including empty lines, to maintain text structure
- Line numbers are aligned through natural string formatting
- Customizable separator allows for different formatting needs
- Can start from any line number for flexibility in text processing
Source code in src/tnh_scholar/text_processing/bracket.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
unbracket_all_lines(pages)
Source code in src/tnh_scholar/text_processing/bracket.py
121 122 123 124 125 126 127 128 | |
unbracket_lines(text, number=False)
Removes angle brackets (< >) from encapsulated lines and optionally removes line numbers.
Args:
text (str): The input string with encapsulated lines.
number (bool): If True, removes line numbers in the format 'digit:'.
Raises a ValueError if `number=True` and a line does not start with a digit followed by a colon.
Returns:
str: A newline-separated string with the encapsulation removed, and line numbers stripped if specified.
Examples:
>>> unbracket_lines("<1:Line 1>
<2:Line 2>", number=True) 'Line 1 Line 2'
>>> unbracket_lines("<Line 1>
>>> unbracket_lines("<1Line 1>", number=True)
ValueError: Line does not start with a valid number: '1Line 1'
Source code in src/tnh_scholar/text_processing/bracket.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
numbered_text
NumberedFormat
Bases: NamedTuple
Source code in src/tnh_scholar/text_processing/numbered_text.py
9 10 11 12 | |
is_numbered
instance-attribute
separator = None
class-attribute
instance-attribute
start_num = None
class-attribute
instance-attribute
NumberedText
Represents a text document with numbered lines for easy reference and manipulation.
Provides utilities for working with line-numbered text including reading, writing, accessing lines by number, and iterating over numbered lines.
Attributes:
| Name | Type | Description |
|---|---|---|
lines |
List[str]
|
List of text lines |
start |
int
|
Starting line number (default: 1) |
separator |
str
|
Separator between line number and content (default: ": ") |
Examples:
>>> text = "First line\nSecond line\n\nFourth line"
>>> doc = NumberedText(text)
>>> print(doc)
1: First line
2: Second line
3:
4: Fourth line
>>> print(doc.get_line(2))
Second line
>>> for num, line in doc:
... print(f"Line {num}: {len(line)} chars")
Source code in src/tnh_scholar/text_processing/numbered_text.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 | |
content
property
Get original text without line numbers.
end
property
lines = []
instance-attribute
numbered_content
property
Get text with line numbers as a string. Equivalent to str(self)
numbered_lines
property
Get list of lines with line numbers included.
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: Lines with numbers and separator prefixed |
Examples:
>>> doc = NumberedText("First line\nSecond line")
>>> doc.numbered_lines
['1: First line', '2: Second line']
Note
- Unlike str(self), this returns a list rather than joined string
- Maintains consistent formatting with separator
- Useful for processing or displaying individual numbered lines
separator = separator
instance-attribute
size
property
Get the number of lines.
start = start
instance-attribute
LineSegment
dataclass
Represents a segment of lines with start and end indices in 1-based indexing.
The segment follows Python range conventions where start is inclusive and end is exclusive. However, indexing is 1-based to match NumberedText.
Attributes:
| Name | Type | Description |
|---|---|---|
start |
int
|
Starting line number (inclusive, 1-based) |
end |
int
|
Ending line number (exclusive, 1-based) |
Source code in src/tnh_scholar/text_processing/numbered_text.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |
end
instance-attribute
start
instance-attribute
__init__(start, end)
__iter__()
Allow unpacking into start, end pairs.
Source code in src/tnh_scholar/text_processing/numbered_text.py
58 59 60 61 | |
SegmentIterator
Iterator for generating line segments of specified size.
Produces segments of lines with start/end indices following 1-based indexing. The final segment may be smaller than the specified segment size.
Attributes:
| Name | Type | Description |
|---|---|---|
total_lines |
Total number of lines in text |
|
segment_size |
Number of lines per segment |
|
start_line |
Starting line number (1-based) |
|
min_segment_size |
Minimum size for the final segment |
Source code in src/tnh_scholar/text_processing/numbered_text.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
min_segment_size = min_segment_size
instance-attribute
num_segments = (remaining_lines + segment_size - 1) // segment_size
instance-attribute
segment_size = segment_size
instance-attribute
start_line = start_line
instance-attribute
total_lines = total_lines
instance-attribute
__init__(total_lines, segment_size, start_line=1, min_segment_size=None)
Initialize the segment iterator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
total_lines
|
int
|
Total number of lines to iterate over |
required |
segment_size
|
int
|
Desired size of each segment |
required |
start_line
|
int
|
First line number (default: 1) |
1
|
min_segment_size
|
Optional[int]
|
Minimum size for final segment (default: None) If specified, the last segment will be merged with the previous one if it would be smaller than this size. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If segment_size < 1 or total_lines < 1 |
ValueError
|
If start_line < 1 (must use 1-based indexing) |
ValueError
|
If min_segment_size >= segment_size |
Source code in src/tnh_scholar/text_processing/numbered_text.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | |
__iter__()
Iterate over line segments.
Yields:
| Type | Description |
|---|---|
LineSegment
|
LineSegment containing start (inclusive) and end (exclusive) indices |
Source code in src/tnh_scholar/text_processing/numbered_text.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
__getitem__(index)
Get line content by line number (1-based indexing).
Source code in src/tnh_scholar/text_processing/numbered_text.py
251 252 253 | |
__init__(content=None, start=1, separator=':')
Initialize a numbered text document, detecting and preserving existing numbering.
Valid numbered text must have: - Sequential line numbers - Consistent separator character(s) - Every non-empty line must follow the numbering pattern
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
Optional[str]
|
Initial text content, if any |
None
|
start
|
int
|
Starting line number (used only if content isn't already numbered) |
1
|
separator
|
str
|
Separator between line numbers and content |
':'
|
Examples:
>>> # Custom separators
>>> doc = NumberedText("1→First line\n2→Second line")
>>> doc.separator == "→"
True
>>> # Preserves starting number
>>> doc = NumberedText("5#First\n6#Second")
>>> doc.start == 5
True
>>> # Regular numbered list isn't treated as line numbers
>>> doc = NumberedText("1. First item\n2. Second item")
>>> doc.numbered_lines
['1: 1. First item', '2: 2. Second item']
Source code in src/tnh_scholar/text_processing/numbered_text.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | |
__iter__()
Iterate over (line_number, line_content) pairs.
Source code in src/tnh_scholar/text_processing/numbered_text.py
247 248 249 | |
__len__()
Return the number of lines.
Source code in src/tnh_scholar/text_processing/numbered_text.py
243 244 245 | |
__str__()
Return the numbered text representation.
Source code in src/tnh_scholar/text_processing/numbered_text.py
237 238 239 240 241 | |
append(text)
Append text, splitting into lines if needed.
Source code in src/tnh_scholar/text_processing/numbered_text.py
328 329 330 | |
from_file(path, **kwargs)
classmethod
Create a NumberedText instance from a file.
Source code in src/tnh_scholar/text_processing/numbered_text.py
218 219 220 221 | |
get_line(line_num)
Get content of specified line number.
Source code in src/tnh_scholar/text_processing/numbered_text.py
255 256 257 | |
get_lines(start, end)
Get content of line range, not inclusive of end line.
Source code in src/tnh_scholar/text_processing/numbered_text.py
267 268 269 | |
get_numbered_line(line_num)
Get specified line with line number.
Source code in src/tnh_scholar/text_processing/numbered_text.py
262 263 264 265 | |
get_numbered_lines(start, end)
Source code in src/tnh_scholar/text_processing/numbered_text.py
271 272 273 274 275 | |
get_numbered_segment(start, end)
Source code in src/tnh_scholar/text_processing/numbered_text.py
314 315 | |
get_segment(start, end)
return the segment from start line (inclusive) up to end line (exclusive)
Source code in src/tnh_scholar/text_processing/numbered_text.py
276 277 278 279 280 281 282 283 284 | |
insert(line_num, text)
Insert text at specified line number. Assumes text is not empty.
Source code in src/tnh_scholar/text_processing/numbered_text.py
332 333 334 335 336 | |
iter_segments(segment_size, min_segment_size=None)
Iterate over segments of the text with specified size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segment_size
|
int
|
Number of lines per segment |
required |
min_segment_size
|
Optional[int]
|
Optional minimum size for final segment. If specified, last segment will be merged with previous one if it would be smaller than this size. |
None
|
Yields:
| Type | Description |
|---|---|
LineSegment
|
LineSegment objects containing start and end line numbers |
Example
text = NumberedText("line1\nline2\nline3\nline4\nline5") for segment in text.iter_segments(2): ... print(f"Lines {segment.start}-{segment.end}") Lines 1-3 Lines 3-5 Lines 5-6
Source code in src/tnh_scholar/text_processing/numbered_text.py
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 | |
remove_whitespace()
Remove leading and trailing whitespace from all lines.
Source code in src/tnh_scholar/text_processing/numbered_text.py
341 342 343 | |
reset_numbering()
Source code in src/tnh_scholar/text_processing/numbered_text.py
338 339 | |
save(path, numbered=True)
Save document to file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path |
required |
numbered
|
bool
|
Whether to save with line numbers (default: True) |
True
|
Source code in src/tnh_scholar/text_processing/numbered_text.py
317 318 319 320 321 322 323 324 325 326 | |
get_numbered_format(text)
Analyze text to determine if it follows a consistent line numbering format.
Valid formats have: - Sequential numbers starting from some value - Consistent separator character(s) - Every line must follow the format
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Type | Description |
|---|---|
NumberedFormat
|
Tuple of (is_numbered, separator, start_number) |
Examples:
>>> _analyze_numbered_format("1→First\n2→Second")
(True, "→", 1)
>>> _analyze_numbered_format("1. First") # Numbered list format
(False, None, None)
>>> _analyze_numbered_format("5#Line\n6#Other")
(True, "#", 5)
Source code in src/tnh_scholar/text_processing/numbered_text.py
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 | |
text_processing
clean_text(text, newline=False)
Cleans a given text by replacing specific unwanted characters such as tab, and non-breaking spaces with regular spaces.
This function takes a string as input and applies replacements based on a predefined mapping of characters to replace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to be cleaned. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
The cleaned text with unwanted characters replaced by spaces. |
Example
text = "This is\n an example\ttext with\xa0extra spaces." clean_text(text) 'This is an example text with extra spaces.'
Source code in src/tnh_scholar/text_processing/text_processing.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
normalize_newlines(text, spacing=2)
Normalize newline blocks in the input text by reducing consecutive newlines
to the specified number of newlines for consistent readability and formatting.
Parameters:
----------
text : str
The input text containing inconsistent newline spacing.
spacing : int, optional
The number of newlines to insert between lines. Defaults to 2.
Returns:
-------
str
The text with consecutive newlines reduced to the specified number of newlines.
Example:
--------
>>> raw_text = "Heading
Paragraph text 1 Paragraph text 2
" >>> normalize_newlines(raw_text, spacing=2) 'Heading
Paragraph text 1
Paragraph text 2
'
Source code in src/tnh_scholar/text_processing/text_processing.py
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
tools
Internal helper utilities for dev workflows.
notebook_prep
Utilities for maintaining paired *_local.ipynb notebooks.
EXCLUDED_PARTS = {'.ipynb_checkpoints'}
module-attribute
prep_notebooks(directory, dry_run=True)
Create *_local notebooks and strip outputs from originals.
Parameters
directory:
Directory whose notebooks will be processed.
dry_run:
When True only report pending work without copying files or invoking
nbconvert.
Source code in src/tnh_scholar/tools/notebook_prep.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | |
tree_builder
Helpers for generating directory-tree text files.
build_tree(root_dir, src_dir=None)
Generate directory trees for the project and optionally its source directory.
Source code in src/tnh_scholar/tools/tree_builder.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
utils
__all__ = ['copy_files_with_regex', 'ensure_directory_exists', 'ensure_directory_writable', 'iterate_subdir', 'path_as_str', 'read_str_from_file', 'sanitize_filename', 'to_slug', 'write_str_to_file', 'load_json_into_model', 'load_jsonl_to_dict', 'save_model_to_json', 'get_language_code_from_text', 'get_language_from_code', 'get_language_name_from_text', 'ExpectedTimeTQDM', 'TimeProgress', 'TimeMs', 'TNHAudioSegment', 'convert_ms_to_sec', 'convert_sec_to_ms', 'get_user_confirmation', 'check_ocr_env', 'check_openai_env']
module-attribute
ExpectedTimeTQDM
A context manager for a time-based tqdm progress bar with optional delay.
- 'expected_time': number of seconds we anticipate the task might take.
- 'display_interval': how often (seconds) to refresh the bar.
- 'desc': a short description for the bar.
- 'delay_start': how many seconds to wait (sleep) before we even create/start the bar.
If the task finishes before 'delay_start' has elapsed, the bar may never appear.
Source code in src/tnh_scholar/utils/progress_utils.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
delay_start = delay_start
instance-attribute
desc = desc
instance-attribute
display_interval = display_interval
instance-attribute
expected_time = round(expected_time)
instance-attribute
__enter__()
Source code in src/tnh_scholar/utils/progress_utils.py
41 42 43 44 45 46 47 48 49 | |
__exit__(exc_type, exc_value, traceback)
Source code in src/tnh_scholar/utils/progress_utils.py
70 71 72 73 74 75 76 77 78 79 80 81 | |
__init__(expected_time, display_interval=0.5, desc='Time-based Progress', delay_start=1.0)
Source code in src/tnh_scholar/utils/progress_utils.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
TNHAudioSegment
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | |
raw
property
Access the underlying pydub.AudioSegment if needed.
__add__(other)
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
66 67 | |
__getitem__(key)
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
63 64 | |
__iadd__(other)
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
69 70 71 | |
__init__(segment)
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
26 27 | |
__len__()
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
73 74 | |
empty()
staticmethod
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
59 60 61 | |
export(out_f, format, **kwargs)
Wrapper: Export the audio segment to a file-like object or file path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
out_f
|
str | BinaryIO
|
File path or file-like object to write the audio data to. |
required |
format
|
str
|
Audio format (e.g., 'mp3', 'wav'). |
required |
**kwargs
|
Additional keyword arguments passed to pydub.AudioSegment.export. |
{}
|
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
44 45 46 47 48 49 50 51 52 53 | |
from_file(file, format=None, **kwargs)
staticmethod
Wrapper: Load an audio file into a TNHAudioSegment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
str | Path | BytesIO
|
Path to the audio file. |
required |
format
|
str | None
|
Optional audio format (e.g., 'mp3', 'wav'). If None, pydub will attempt to infer it. |
None
|
**kwargs
|
Additional keyword arguments passed to pydub.AudioSegment.from_file. |
{}
|
Returns:
| Type | Description |
|---|---|
TNHAudioSegment
|
TNHAudioSegment instance containing the loaded audio. |
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
silent(duration)
staticmethod
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
55 56 57 | |
TimeMs
Bases: int
Lightweight representation of a time interval or timestamp in milliseconds. Allows negative values.
Source code in src/tnh_scholar/utils/timing_utils.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
__add__(other)
Source code in src/tnh_scholar/utils/timing_utils.py
62 63 | |
__get_pydantic_core_schema__(source_type, handler)
classmethod
Source code in src/tnh_scholar/utils/timing_utils.py
41 42 43 44 45 46 | |
__new__(ms)
Source code in src/tnh_scholar/utils/timing_utils.py
20 21 22 23 24 25 26 27 28 29 | |
__radd__(other)
Source code in src/tnh_scholar/utils/timing_utils.py
65 66 | |
__repr__()
Source code in src/tnh_scholar/utils/timing_utils.py
74 75 | |
__rsub__(other)
Source code in src/tnh_scholar/utils/timing_utils.py
71 72 | |
__sub__(other)
Source code in src/tnh_scholar/utils/timing_utils.py
68 69 | |
from_seconds(seconds)
classmethod
Source code in src/tnh_scholar/utils/timing_utils.py
31 32 33 | |
to_ms()
Source code in src/tnh_scholar/utils/timing_utils.py
35 36 | |
to_seconds()
Source code in src/tnh_scholar/utils/timing_utils.py
38 39 | |
TimeProgress
A context manager for a time-based progress display using dots.
The display updates once per second, printing a dot and showing: - Expected time (if provided) - Elapsed time (always displayed)
Example:
import time with ExpectedTimeProgress(expected_time=60, desc="Transcribing..."): ... time.sleep(5) # Simulate work [Expected Time: 1:00, Elapsed Time: 0:05] .....
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_time
|
Optional[float]
|
Expected time in seconds. Optional. |
None
|
display_interval
|
float
|
How often to print a dot (seconds). |
1.0
|
desc
|
str
|
Description to display alongside the progress. |
''
|
Source code in src/tnh_scholar/utils/progress_utils.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
desc = desc
instance-attribute
display_interval = display_interval
instance-attribute
expected_time = expected_time
instance-attribute
__enter__()
Source code in src/tnh_scholar/utils/progress_utils.py
122 123 124 125 126 127 128 129 130 | |
__exit__(exc_type, exc_value, traceback)
Source code in src/tnh_scholar/utils/progress_utils.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | |
__init__(expected_time=None, display_interval=1.0, desc='')
Source code in src/tnh_scholar/utils/progress_utils.py
108 109 110 111 112 113 114 115 116 117 118 119 120 | |
check_ocr_env(output=True)
Check OCR processing requirements.
Source code in src/tnh_scholar/utils/validate.py
57 58 59 | |
check_openai_env(output=True)
Check OpenAI API requirements.
Source code in src/tnh_scholar/utils/validate.py
53 54 55 | |
convert_ms_to_sec(ms)
Convert time from milliseconds (int) to seconds (float).
Source code in src/tnh_scholar/utils/timing_utils.py
83 84 85 | |
convert_sec_to_ms(val)
Convert seconds to milliseconds, rounding to the nearest integer.
Source code in src/tnh_scholar/utils/timing_utils.py
77 78 79 80 81 | |
copy_files_with_regex(source_dir, destination_dir, regex_patterns, preserve_structure=True)
Copies files from subdirectories one level down in the source directory to the destination directory if they match any regex pattern. Optionally preserves the directory structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_dir
|
Path
|
Path to the source directory to search files in. |
required |
destination_dir
|
Path
|
Path to the destination directory where files will be copied. |
required |
regex_patterns
|
list[str]
|
List of regex patterns to match file names. |
required |
preserve_structure
|
bool
|
Whether to preserve the directory structure. Defaults to True. |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the source directory does not exist or is not a directory. |
Example
copy_files_with_regex( ... source_dir=Path("/path/to/source"), ... destination_dir=Path("/path/to/destination"), ... regex_patterns=[r'..txt$', r'..log$'], ... preserve_structure=True ... )
Source code in src/tnh_scholar/utils/file_utils.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
ensure_directory_exists(dir_path)
Create directory if it doesn't exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path
|
Directory path to ensure exists. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the directory exists or was created successfully, False otherwise. |
Source code in src/tnh_scholar/utils/file_utils.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | |
ensure_directory_writable(dir_path)
Ensure the directory exists and is writable. Creates the directory if it does not exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path
|
Directory to verify or create. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the directory cannot be created or is not writable. |
TypeError
|
If the provided path is not a Path instance. |
Source code in src/tnh_scholar/utils/file_utils.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | |
get_language_code_from_text(text)
Detect the language of the provided text using langdetect.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
return result 'code' ISO 639-1 for detected language. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If text is empty or invalid |
Source code in src/tnh_scholar/utils/lang.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
get_language_from_code(code)
Source code in src/tnh_scholar/utils/lang.py
40 41 42 43 44 | |
get_language_name_from_text(text)
Source code in src/tnh_scholar/utils/lang.py
36 37 | |
get_user_confirmation(prompt, default=True)
Prompt the user for a yes/no confirmation with single-character input. Cross-platform implementation. Returns True if 'y' is entered, and False if 'n' Allows for default value if return is entered.
Example usage if get_user_confirmation("Do you want to continue"): print("Continuing...") else: print("Exiting...")
Source code in src/tnh_scholar/utils/user_io_utils.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | |
iterate_subdir(directory, recursive=False)
Iterates through subdirectories in the given directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Path
|
The root directory to start the iteration. |
required |
recursive
|
bool
|
If True, iterates recursively through all subdirectories. If False, iterates only over the immediate subdirectories. |
False
|
Yields:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Paths to each subdirectory. |
Example
for subdir in iterate_subdir(Path('/root'), recursive=False): ... print(subdir)
Source code in src/tnh_scholar/utils/file_utils.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | |
load_json_into_model(file, model)
Loads a JSON file and validates it against a Pydantic model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file. |
required |
model
|
type[BaseModel]
|
The Pydantic model to validate against. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseModel |
BaseModel
|
An instance of the validated Pydantic model. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file content is invalid JSON or does not match the model. |
Example: class ExampleModel(BaseModel): name: str age: int city: str
if __name__ == "__main__":
json_file = Path("example.json")
try:
data = load_json_into_model(json_file, ExampleModel)
print(data)
except ValueError as e:
print(e)
Source code in src/tnh_scholar/utils/json_utils.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | |
load_jsonl_to_dict(file_path)
Load a JSONL file into a list of dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the JSONL file. |
required |
Returns:
| Type | Description |
|---|---|
List[Dict]
|
List[Dict]: A list of dictionaries, each representing a line in the JSONL file. |
Example
from pathlib import Path file_path = Path("data.jsonl") data = load_jsonl_to_dict(file_path) print(data) [{'key1': 'value1'}, {'key2': 'value2'}]
Source code in src/tnh_scholar/utils/json_utils.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
path_as_str(path)
Source code in src/tnh_scholar/utils/file_utils.py
243 244 | |
read_str_from_file(file_path)
Reads the entire content of a text file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
The path to the text file. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The content of the text file as a single string. |
Source code in src/tnh_scholar/utils/file_utils.py
156 157 158 159 160 161 162 163 164 165 166 167 | |
sanitize_filename(filename, max_length=DEFAULT_MAX_FILENAME_LENGTH)
Sanitize filename for use unix use.
Source code in src/tnh_scholar/utils/file_utils.py
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | |
save_model_to_json(file, model, indent=4, ensure_ascii=False)
Saves a Pydantic model to a JSON file, formatted with indentation for readability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file where the model will be saved. |
required |
model
|
BaseModel
|
The Pydantic model instance to save. |
required |
indent
|
int
|
Number of spaces for JSON indentation. Defaults to 4. |
4
|
ensure_ascii
|
bool
|
Whether to escape non-ASCII characters. Defaults to False. |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model cannot be serialized to JSON. |
IOError
|
If there is an issue writing to the file. |
Example
class ExampleModel(BaseModel): name: str age: int
if name == "main": model_instance = ExampleModel(name="John", age=30) json_file = Path("example.json") try: save_model_to_json(json_file, model_instance) print(f"Model saved to {json_file}") except (ValueError, IOError) as e: print(e)
Source code in src/tnh_scholar/utils/json_utils.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | |
to_slug(string)
Slugify a Unicode string.
Converts a string to a strict URL-friendly slug format, allowing only lowercase letters, digits, and hyphens.
Example
slugify("Héllø_Wörld!") 'hello-world'
Source code in src/tnh_scholar/utils/file_utils.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | |
write_str_to_file(file_path, text, overwrite=False)
Writes text to a file with file locking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
PathLike
|
The path to the file to write. |
required |
text
|
str
|
The text to write to the file. |
required |
overwrite
|
bool
|
Whether to overwrite the file if it exists. |
False
|
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the file exists and overwrite is False. |
OSError
|
If there's an issue with file locking or writing. |
Source code in src/tnh_scholar/utils/file_utils.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | |
file_utils
DEFAULT_MAX_FILENAME_LENGTH = 25
module-attribute
PathLike = Union[str, Path]
module-attribute
__all__ = ['DEFAULT_MAX_FILENAME_LENGTH', 'FileExistsWarning', 'ensure_directory_exists', 'ensure_directory_writable', 'iterate_subdir', 'path_source_str', 'copy_files_with_regex', 'read_str_from_file', 'write_str_to_file', 'sanitize_filename', 'to_slug', 'path_as_str']
module-attribute
FileExistsWarning
Bases: UserWarning
Source code in src/tnh_scholar/utils/file_utils.py
12 13 | |
copy_files_with_regex(source_dir, destination_dir, regex_patterns, preserve_structure=True)
Copies files from subdirectories one level down in the source directory to the destination directory if they match any regex pattern. Optionally preserves the directory structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_dir
|
Path
|
Path to the source directory to search files in. |
required |
destination_dir
|
Path
|
Path to the destination directory where files will be copied. |
required |
regex_patterns
|
list[str]
|
List of regex patterns to match file names. |
required |
preserve_structure
|
bool
|
Whether to preserve the directory structure. Defaults to True. |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the source directory does not exist or is not a directory. |
Example
copy_files_with_regex( ... source_dir=Path("/path/to/source"), ... destination_dir=Path("/path/to/destination"), ... regex_patterns=[r'..txt$', r'..log$'], ... preserve_structure=True ... )
Source code in src/tnh_scholar/utils/file_utils.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
ensure_directory_exists(dir_path)
Create directory if it doesn't exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path
|
Directory path to ensure exists. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the directory exists or was created successfully, False otherwise. |
Source code in src/tnh_scholar/utils/file_utils.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | |
ensure_directory_writable(dir_path)
Ensure the directory exists and is writable. Creates the directory if it does not exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path
|
Directory to verify or create. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the directory cannot be created or is not writable. |
TypeError
|
If the provided path is not a Path instance. |
Source code in src/tnh_scholar/utils/file_utils.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | |
iterate_subdir(directory, recursive=False)
Iterates through subdirectories in the given directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Path
|
The root directory to start the iteration. |
required |
recursive
|
bool
|
If True, iterates recursively through all subdirectories. If False, iterates only over the immediate subdirectories. |
False
|
Yields:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Paths to each subdirectory. |
Example
for subdir in iterate_subdir(Path('/root'), recursive=False): ... print(subdir)
Source code in src/tnh_scholar/utils/file_utils.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | |
path_as_str(path)
Source code in src/tnh_scholar/utils/file_utils.py
243 244 | |
path_source_str(path)
Source code in src/tnh_scholar/utils/file_utils.py
86 87 | |
read_str_from_file(file_path)
Reads the entire content of a text file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
The path to the text file. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The content of the text file as a single string. |
Source code in src/tnh_scholar/utils/file_utils.py
156 157 158 159 160 161 162 163 164 165 166 167 | |
sanitize_filename(filename, max_length=DEFAULT_MAX_FILENAME_LENGTH)
Sanitize filename for use unix use.
Source code in src/tnh_scholar/utils/file_utils.py
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | |
to_slug(string)
Slugify a Unicode string.
Converts a string to a strict URL-friendly slug format, allowing only lowercase letters, digits, and hyphens.
Example
slugify("Héllø_Wörld!") 'hello-world'
Source code in src/tnh_scholar/utils/file_utils.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | |
write_str_to_file(file_path, text, overwrite=False)
Writes text to a file with file locking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
PathLike
|
The path to the file to write. |
required |
text
|
str
|
The text to write to the file. |
required |
overwrite
|
bool
|
Whether to overwrite the file if it exists. |
False
|
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the file exists and overwrite is False. |
OSError
|
If there's an issue with file locking or writing. |
Source code in src/tnh_scholar/utils/file_utils.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | |
json_utils
format_json(file)
Formats a JSON file with line breaks and indentation for readability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file to be formatted. |
required |
Example
format_json(Path("data.json"))
Source code in src/tnh_scholar/utils/json_utils.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | |
load_json_into_model(file, model)
Loads a JSON file and validates it against a Pydantic model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file. |
required |
model
|
type[BaseModel]
|
The Pydantic model to validate against. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseModel |
BaseModel
|
An instance of the validated Pydantic model. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file content is invalid JSON or does not match the model. |
Example: class ExampleModel(BaseModel): name: str age: int city: str
if __name__ == "__main__":
json_file = Path("example.json")
try:
data = load_json_into_model(json_file, ExampleModel)
print(data)
except ValueError as e:
print(e)
Source code in src/tnh_scholar/utils/json_utils.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | |
load_jsonl_to_dict(file_path)
Load a JSONL file into a list of dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the JSONL file. |
required |
Returns:
| Type | Description |
|---|---|
List[Dict]
|
List[Dict]: A list of dictionaries, each representing a line in the JSONL file. |
Example
from pathlib import Path file_path = Path("data.jsonl") data = load_jsonl_to_dict(file_path) print(data) [{'key1': 'value1'}, {'key2': 'value2'}]
Source code in src/tnh_scholar/utils/json_utils.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
save_model_to_json(file, model, indent=4, ensure_ascii=False)
Saves a Pydantic model to a JSON file, formatted with indentation for readability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file where the model will be saved. |
required |
model
|
BaseModel
|
The Pydantic model instance to save. |
required |
indent
|
int
|
Number of spaces for JSON indentation. Defaults to 4. |
4
|
ensure_ascii
|
bool
|
Whether to escape non-ASCII characters. Defaults to False. |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model cannot be serialized to JSON. |
IOError
|
If there is an issue writing to the file. |
Example
class ExampleModel(BaseModel): name: str age: int
if name == "main": model_instance = ExampleModel(name="John", age=30) json_file = Path("example.json") try: save_model_to_json(json_file, model_instance) print(f"Model saved to {json_file}") except (ValueError, IOError) as e: print(e)
Source code in src/tnh_scholar/utils/json_utils.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | |
write_data_to_json_file(file, data, indent=4, ensure_ascii=False)
Writes a dictionary or list as a JSON string to a file, ensuring the parent directory exists, and supports formatting with indentation and ASCII control.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file where the data will be written. |
required |
data
|
Union[dict, list]
|
The data to write to the file. Typically a dict or list. |
required |
indent
|
int
|
Number of spaces for JSON indentation. Defaults to 4. |
4
|
ensure_ascii
|
bool
|
Whether to escape non-ASCII characters. Defaults to False. |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the data cannot be serialized to JSON. |
IOError
|
If there is an issue writing to the file. |
Example
from pathlib import Path data = {"key": "value"} write_json_str_to_file(Path("output.json"), data, indent=2, ensure_ascii=True)
Source code in src/tnh_scholar/utils/json_utils.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
lang
logger = get_child_logger(__name__)
module-attribute
get_language_code_from_text(text)
Detect the language of the provided text using langdetect.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
return result 'code' ISO 639-1 for detected language. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If text is empty or invalid |
Source code in src/tnh_scholar/utils/lang.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
get_language_from_code(code)
Source code in src/tnh_scholar/utils/lang.py
40 41 42 43 44 | |
get_language_name_from_text(text)
Source code in src/tnh_scholar/utils/lang.py
36 37 | |
progress_utils
BAR_FORMAT = '{desc}: {percentage:3.0f}%|{bar}| Total: {total_fmt} sec. [elapsed: {elapsed}]'
module-attribute
ExpectedTimeTQDM
A context manager for a time-based tqdm progress bar with optional delay.
- 'expected_time': number of seconds we anticipate the task might take.
- 'display_interval': how often (seconds) to refresh the bar.
- 'desc': a short description for the bar.
- 'delay_start': how many seconds to wait (sleep) before we even create/start the bar.
If the task finishes before 'delay_start' has elapsed, the bar may never appear.
Source code in src/tnh_scholar/utils/progress_utils.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
delay_start = delay_start
instance-attribute
desc = desc
instance-attribute
display_interval = display_interval
instance-attribute
expected_time = round(expected_time)
instance-attribute
__enter__()
Source code in src/tnh_scholar/utils/progress_utils.py
41 42 43 44 45 46 47 48 49 | |
__exit__(exc_type, exc_value, traceback)
Source code in src/tnh_scholar/utils/progress_utils.py
70 71 72 73 74 75 76 77 78 79 80 81 | |
__init__(expected_time, display_interval=0.5, desc='Time-based Progress', delay_start=1.0)
Source code in src/tnh_scholar/utils/progress_utils.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
TimeProgress
A context manager for a time-based progress display using dots.
The display updates once per second, printing a dot and showing: - Expected time (if provided) - Elapsed time (always displayed)
Example:
import time with ExpectedTimeProgress(expected_time=60, desc="Transcribing..."): ... time.sleep(5) # Simulate work [Expected Time: 1:00, Elapsed Time: 0:05] .....
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_time
|
Optional[float]
|
Expected time in seconds. Optional. |
None
|
display_interval
|
float
|
How often to print a dot (seconds). |
1.0
|
desc
|
str
|
Description to display alongside the progress. |
''
|
Source code in src/tnh_scholar/utils/progress_utils.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
desc = desc
instance-attribute
display_interval = display_interval
instance-attribute
expected_time = expected_time
instance-attribute
__enter__()
Source code in src/tnh_scholar/utils/progress_utils.py
122 123 124 125 126 127 128 129 130 | |
__exit__(exc_type, exc_value, traceback)
Source code in src/tnh_scholar/utils/progress_utils.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | |
__init__(expected_time=None, display_interval=1.0, desc='')
Source code in src/tnh_scholar/utils/progress_utils.py
108 109 110 111 112 113 114 115 116 117 118 119 120 | |
timing_utils
TimeMs
Bases: int
Lightweight representation of a time interval or timestamp in milliseconds. Allows negative values.
Source code in src/tnh_scholar/utils/timing_utils.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
__add__(other)
Source code in src/tnh_scholar/utils/timing_utils.py
62 63 | |
__get_pydantic_core_schema__(source_type, handler)
classmethod
Source code in src/tnh_scholar/utils/timing_utils.py
41 42 43 44 45 46 | |
__new__(ms)
Source code in src/tnh_scholar/utils/timing_utils.py
20 21 22 23 24 25 26 27 28 29 | |
__radd__(other)
Source code in src/tnh_scholar/utils/timing_utils.py
65 66 | |
__repr__()
Source code in src/tnh_scholar/utils/timing_utils.py
74 75 | |
__rsub__(other)
Source code in src/tnh_scholar/utils/timing_utils.py
71 72 | |
__sub__(other)
Source code in src/tnh_scholar/utils/timing_utils.py
68 69 | |
from_seconds(seconds)
classmethod
Source code in src/tnh_scholar/utils/timing_utils.py
31 32 33 | |
to_ms()
Source code in src/tnh_scholar/utils/timing_utils.py
35 36 | |
to_seconds()
Source code in src/tnh_scholar/utils/timing_utils.py
38 39 | |
convert_ms_to_sec(ms)
Convert time from milliseconds (int) to seconds (float).
Source code in src/tnh_scholar/utils/timing_utils.py
83 84 85 | |
convert_sec_to_ms(val)
Convert seconds to milliseconds, rounding to the nearest integer.
Source code in src/tnh_scholar/utils/timing_utils.py
77 78 79 80 81 | |
tnh_audio_segment
TNHAudioSegment: A typed, minimal wrapper for pydub.AudioSegment.
This class provides a type-safe interface for working with audio segments using pydub, enabling easier composition, slicing, and manipulation of audio data. It exposes common operations such as concatenation, slicing, and length retrieval, while hiding the underlying pydub implementation.
Key features
- Type-annotated methods for static analysis and IDE support
- Static constructors for silent and empty segments
- Operator overloads for concatenation and slicing
- Access to the underlying pydub.AudioSegment via the
rawproperty
Extend this class with additional methods as needed for your audio processing workflows.
TNHAudioSegment
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | |
raw
property
Access the underlying pydub.AudioSegment if needed.
__add__(other)
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
66 67 | |
__getitem__(key)
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
63 64 | |
__iadd__(other)
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
69 70 71 | |
__init__(segment)
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
26 27 | |
__len__()
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
73 74 | |
empty()
staticmethod
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
59 60 61 | |
export(out_f, format, **kwargs)
Wrapper: Export the audio segment to a file-like object or file path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
out_f
|
str | BinaryIO
|
File path or file-like object to write the audio data to. |
required |
format
|
str
|
Audio format (e.g., 'mp3', 'wav'). |
required |
**kwargs
|
Additional keyword arguments passed to pydub.AudioSegment.export. |
{}
|
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
44 45 46 47 48 49 50 51 52 53 | |
from_file(file, format=None, **kwargs)
staticmethod
Wrapper: Load an audio file into a TNHAudioSegment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
str | Path | BytesIO
|
Path to the audio file. |
required |
format
|
str | None
|
Optional audio format (e.g., 'mp3', 'wav'). If None, pydub will attempt to infer it. |
None
|
**kwargs
|
Additional keyword arguments passed to pydub.AudioSegment.from_file. |
{}
|
Returns:
| Type | Description |
|---|---|
TNHAudioSegment
|
TNHAudioSegment instance containing the loaded audio. |
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
silent(duration)
staticmethod
Source code in src/tnh_scholar/utils/tnh_audio_segment.py
55 56 57 | |
user_io_utils
get_single_char(prompt=None)
Get a single character from input, adapting to the execution environment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
Optional[str]
|
Optional prompt to display before getting input |
None
|
Returns:
| Type | Description |
|---|---|
str
|
A single character string from user input |
Note
- In terminal environments, uses raw input mode without requiring Enter
- In Jupyter/IPython, falls back to regular input with message about Enter
Source code in src/tnh_scholar/utils/user_io_utils.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
get_user_confirmation(prompt, default=True)
Prompt the user for a yes/no confirmation with single-character input. Cross-platform implementation. Returns True if 'y' is entered, and False if 'n' Allows for default value if return is entered.
Example usage if get_user_confirmation("Do you want to continue"): print("Continuing...") else: print("Exiting...")
Source code in src/tnh_scholar/utils/user_io_utils.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | |
validate
OCR_ENV_VARS = {'GOOGLE_APPLICATION_CREDENTIALS'}
module-attribute
OPENAI_ENV_VARS = {'OPENAI_API_KEY'}
module-attribute
logger = get_child_logger(__name__)
module-attribute
check_env(required_vars, feature='this feature', output=True)
Check environment variables and provide user-friendly error messages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
required_vars
|
Set[str]
|
Set of environment variable names to check |
required |
feature
|
str
|
Description of feature requiring these variables |
'this feature'
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if all required variables are set |
Source code in src/tnh_scholar/utils/validate.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | |
check_ocr_env(output=True)
Check OCR processing requirements.
Source code in src/tnh_scholar/utils/validate.py
57 58 59 | |
check_openai_env(output=True)
Check OpenAI API requirements.
Source code in src/tnh_scholar/utils/validate.py
53 54 55 | |
get_env_message(missing_vars, feature='this feature')
Generate user-friendly environment setup message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
missing_vars
|
List[str]
|
List of missing environment variable names |
required |
feature
|
str
|
Name of feature requiring the variables |
'this feature'
|
Returns:
| Type | Description |
|---|---|
str
|
Formatted error message with setup instructions |
Source code in src/tnh_scholar/utils/validate.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
version_check
Version checker package for monitoring package version compatibility.
__all__ = ['PackageVersionChecker', 'VersionCheckerConfig', 'VersionStrategy', 'Result', 'PackageInfo']
module-attribute
PackageInfo
dataclass
Information about a package and its versions.
Source code in src/tnh_scholar/utils/version_check/models.py
7 8 9 10 11 12 13 14 | |
installed_version = None
class-attribute
instance-attribute
latest_version = None
class-attribute
instance-attribute
name
instance-attribute
required_version = None
class-attribute
instance-attribute
__init__(name, installed_version=None, latest_version=None, required_version=None)
PackageVersionChecker
Main class for checking package versions against requirements.
Source code in src/tnh_scholar/utils/version_check/checker.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
cache = cache or VersionCache()
instance-attribute
provider = provider or StandardVersionProvider()
instance-attribute
__init__(provider=None, cache=None)
Source code in src/tnh_scholar/utils/version_check/checker.py
19 20 21 22 23 | |
check_version(package_name, config=None)
Check if package meets version requirements based on config.
Source code in src/tnh_scholar/utils/version_check/checker.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
Result
dataclass
Result of a version check operation.
Source code in src/tnh_scholar/utils/version_check/models.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
diff_details = None
class-attribute
instance-attribute
error = None
class-attribute
instance-attribute
is_compatible
instance-attribute
needs_update
instance-attribute
package_info
instance-attribute
warning_level = None
class-attribute
instance-attribute
__init__(is_compatible, needs_update, package_info, error=None, warning_level=None, diff_details=None)
get_upgrade_command()
Return pip command to upgrade package.
Source code in src/tnh_scholar/utils/version_check/models.py
27 28 29 30 31 32 33 34 35 | |
VersionCheckerConfig
Configuration for version checking behavior.
Source code in src/tnh_scholar/utils/version_check/config.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
cache_duration = cache_duration
instance-attribute
fail_on_error = fail_on_error
instance-attribute
network_timeout = network_timeout
instance-attribute
requirement = requirement
instance-attribute
strategy = strategy
instance-attribute
vdiff_fail_matrix = vdiff_fail_matrix
instance-attribute
vdiff_warn_matrix = vdiff_warn_matrix
instance-attribute
__init__(strategy=VersionStrategy.MINIMUM, requirement='', fail_on_error=False, cache_duration=3600, network_timeout=5, vdiff_warn_matrix=None, vdiff_fail_matrix=None)
Initialize version checker configuration.
Source code in src/tnh_scholar/utils/version_check/config.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
get_required_version()
Get required version as a Version object.
Source code in src/tnh_scholar/utils/version_check/config.py
37 38 39 | |
VersionStrategy
Bases: Enum
Enumeration of version checking strategies.
Source code in src/tnh_scholar/utils/version_check/config.py
9 10 11 12 13 14 15 | |
EXACT = 'exact'
class-attribute
instance-attribute
LATEST = 'latest'
class-attribute
instance-attribute
MINIMUM = 'minimum'
class-attribute
instance-attribute
RANGE = 'range'
class-attribute
instance-attribute
VERSION_DIFF = 'vdiff'
class-attribute
instance-attribute
cache
Simple caching mechanism for version information.
VersionCache
Simple time-based cache for version information.
Source code in src/tnh_scholar/utils/version_check/cache.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
cache = {}
instance-attribute
cache_duration = cache_duration
instance-attribute
timestamps = {}
instance-attribute
__init__(cache_duration=3600)
Initialize cache with specified expiration time in seconds.
Source code in src/tnh_scholar/utils/version_check/cache.py
13 14 15 16 17 | |
get(key)
Get cached version if still valid.
Source code in src/tnh_scholar/utils/version_check/cache.py
19 20 21 | |
is_valid(key)
Check if cached value is still valid.
Source code in src/tnh_scholar/utils/version_check/cache.py
28 29 30 31 32 33 | |
set(key, value)
Cache version with current timestamp.
Source code in src/tnh_scholar/utils/version_check/cache.py
23 24 25 26 | |
checker
Main version checker implementation.
PackageVersionChecker
Main class for checking package versions against requirements.
Source code in src/tnh_scholar/utils/version_check/checker.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
cache = cache or VersionCache()
instance-attribute
provider = provider or StandardVersionProvider()
instance-attribute
__init__(provider=None, cache=None)
Source code in src/tnh_scholar/utils/version_check/checker.py
19 20 21 22 23 | |
check_version(package_name, config=None)
Check if package meets version requirements based on config.
Source code in src/tnh_scholar/utils/version_check/checker.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
cli
Command-line interface for version checking (stub for future implementation).
main()
Command-line interface for version checking.
Source code in src/tnh_scholar/utils/version_check/cli.py
3 4 5 | |
config
Configuration classes for version checking.
VersionCheckerConfig
Configuration for version checking behavior.
Source code in src/tnh_scholar/utils/version_check/config.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
cache_duration = cache_duration
instance-attribute
fail_on_error = fail_on_error
instance-attribute
network_timeout = network_timeout
instance-attribute
requirement = requirement
instance-attribute
strategy = strategy
instance-attribute
vdiff_fail_matrix = vdiff_fail_matrix
instance-attribute
vdiff_warn_matrix = vdiff_warn_matrix
instance-attribute
__init__(strategy=VersionStrategy.MINIMUM, requirement='', fail_on_error=False, cache_duration=3600, network_timeout=5, vdiff_warn_matrix=None, vdiff_fail_matrix=None)
Initialize version checker configuration.
Source code in src/tnh_scholar/utils/version_check/config.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
get_required_version()
Get required version as a Version object.
Source code in src/tnh_scholar/utils/version_check/config.py
37 38 39 | |
VersionStrategy
Bases: Enum
Enumeration of version checking strategies.
Source code in src/tnh_scholar/utils/version_check/config.py
9 10 11 12 13 14 15 | |
EXACT = 'exact'
class-attribute
instance-attribute
LATEST = 'latest'
class-attribute
instance-attribute
MINIMUM = 'minimum'
class-attribute
instance-attribute
RANGE = 'range'
class-attribute
instance-attribute
VERSION_DIFF = 'vdiff'
class-attribute
instance-attribute
models
Data models for version checking results.
PackageInfo
dataclass
Information about a package and its versions.
Source code in src/tnh_scholar/utils/version_check/models.py
7 8 9 10 11 12 13 14 | |
installed_version = None
class-attribute
instance-attribute
latest_version = None
class-attribute
instance-attribute
name
instance-attribute
required_version = None
class-attribute
instance-attribute
__init__(name, installed_version=None, latest_version=None, required_version=None)
Result
dataclass
Result of a version check operation.
Source code in src/tnh_scholar/utils/version_check/models.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
diff_details = None
class-attribute
instance-attribute
error = None
class-attribute
instance-attribute
is_compatible
instance-attribute
needs_update
instance-attribute
package_info
instance-attribute
warning_level = None
class-attribute
instance-attribute
__init__(is_compatible, needs_update, package_info, error=None, warning_level=None, diff_details=None)
get_upgrade_command()
Return pip command to upgrade package.
Source code in src/tnh_scholar/utils/version_check/models.py
27 28 29 30 31 32 33 34 35 | |
providers
Version provider implementations for retrieving package versions.
StandardVersionProvider
Bases: VersionProvider
Standard implementation of version provider using importlib and PyPI.
Source code in src/tnh_scholar/utils/version_check/providers.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | |
cache = cache or VersionCache()
instance-attribute
pypi_url_template = 'https://pypi.org/pypi/{package}/json'
instance-attribute
timeout = timeout
instance-attribute
__init__(cache=None, timeout=5)
Source code in src/tnh_scholar/utils/version_check/providers.py
30 31 32 33 | |
get_installed_version(package_name)
Get installed package version.
Source code in src/tnh_scholar/utils/version_check/providers.py
35 36 37 38 39 40 41 42 43 44 45 | |
get_latest_version(package_name)
Get latest available package version from PyPI.
Source code in src/tnh_scholar/utils/version_check/providers.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
VersionProvider
Bases: ABC
Interface for retrieving package version information.
Source code in src/tnh_scholar/utils/version_check/providers.py
13 14 15 16 17 18 19 20 21 22 23 24 | |
get_installed_version(package_name)
abstractmethod
Get installed package version.
Source code in src/tnh_scholar/utils/version_check/providers.py
16 17 18 19 | |
get_latest_version(package_name)
abstractmethod
Get latest available package version.
Source code in src/tnh_scholar/utils/version_check/providers.py
21 22 23 24 | |
strategies
Version comparison strategies for package version checking.
check_exact_version(installed, required)
Check if installed version exactly matches requirement.
Source code in src/tnh_scholar/utils/version_check/strategies.py
12 13 14 | |
check_minimum_version(installed, required)
Check if installed version meets minimum requirement.
Source code in src/tnh_scholar/utils/version_check/strategies.py
8 9 10 | |
check_version_diff(installed, reference, vdiff_matrix)
Check if version difference is within specified limits.
Source code in src/tnh_scholar/utils/version_check/strategies.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | |
parse_vdiff_matrix(matrix_str)
Parse a version difference matrix string.
Source code in src/tnh_scholar/utils/version_check/strategies.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
webhook_server
WebhookServer
A generic webhook server that can receive callbacks from external services.
Source code in src/tnh_scholar/utils/webhook_server.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
app = self._create_flask_app()
instance-attribute
flask_running = Event()
instance-attribute
flask_server_thread = None
instance-attribute
port = port
instance-attribute
tunnel_process = None
instance-attribute
webhook_data = None
instance-attribute
webhook_received = Condition()
instance-attribute
__init__(port=5050)
Initialize webhook server with configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
port
|
int
|
The port to run the Flask server on |
5050
|
Source code in src/tnh_scholar/utils/webhook_server.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
cleanup()
Clean up all resources.
Source code in src/tnh_scholar/utils/webhook_server.py
250 251 252 253 | |
close_tunnel()
Close the tunnel if it's running.
Source code in src/tnh_scholar/utils/webhook_server.py
219 220 221 222 223 224 225 | |
create_tunnel()
Create a public webhook URL using py-localtunnel.
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]: The public webhook URL or None if tunnel creation failed |
Source code in src/tnh_scholar/utils/webhook_server.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | |
shutdown_server()
Gracefully shut down the Flask server.
Source code in src/tnh_scholar/utils/webhook_server.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
start_server()
Start Flask server in a separate thread.
Source code in src/tnh_scholar/utils/webhook_server.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
wait_for_webhook(timeout=120)
Wait for webhook data to be received.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timeout
|
int
|
Maximum time to wait in seconds |
120
|
Returns:
| Type | Description |
|---|---|
Optional[Dict]
|
Optional[Dict]: The webhook data or None if timed out |
Source code in src/tnh_scholar/utils/webhook_server.py
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | |
video_processing
video_processing
video_processing.py
BASE_YDL_OPTIONS = {'quiet': False, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger}
module-attribute
DEFAULT_AUDIO_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'noplaylist': True}
module-attribute
DEFAULT_METADATA_FIELDS = ['id', 'title', 'description', 'duration', 'upload_date', 'uploader', 'channel_url', 'webpage_url', 'original_url', 'channel', 'language', 'categories', 'tags']
module-attribute
DEFAULT_METADATA_OPTIONS = BASE_YDL_OPTIONS | {'skip_download': True}
module-attribute
DEFAULT_TRANSCRIPT_OPTIONS = BASE_YDL_OPTIONS | {'skip_download': True, 'writesubtitles': True, 'writeautomaticsub': True, 'subtitlesformat': 'ttml'}
module-attribute
DEFAULT_VIDEO_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestvideo+bestaudio/best', 'merge_output_format': 'mp4', 'noplaylist': True}
module-attribute
TEMP_FILENAME_FORMAT = 'temp_%(id)s'
module-attribute
TEMP_FILENAME_STR = 'temp_{id}'
module-attribute
logger = get_child_logger(__name__)
module-attribute
DLPDownloader
Bases: YTDownloader
yt-dlp based implementation of YouTube content retrieval.
Assures temporary file export is in the form
Renames the export file to be based on title and ID by default, or moves the export file to the specified output file with appropriate extension.
Source code in src/tnh_scholar/video_processing/video_processing.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 | |
config = config or BASE_YDL_OPTIONS
instance-attribute
__init__(config=None)
Source code in src/tnh_scholar/video_processing/video_processing.py
178 179 | |
get_audio(url, start=None, end=None, output_path=None)
Download audio and get metadata for a YouTube video.
Source code in src/tnh_scholar/video_processing/video_processing.py
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | |
get_default_export_name(url)
Get default export filename for a URL.
Source code in src/tnh_scholar/video_processing/video_processing.py
342 343 344 345 | |
get_default_filename_stem(metadata)
Generate the object download filename.
Source code in src/tnh_scholar/video_processing/video_processing.py
333 334 335 336 337 338 339 340 | |
get_metadata(url)
Get metadata for a YouTube video.
Source code in src/tnh_scholar/video_processing/video_processing.py
181 182 183 184 185 186 187 188 189 190 191 192 193 | |
get_transcript(url, lang='en', output_path=None)
Downloads video transcript in TTML format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
lang
|
str
|
Language code for transcript (default: "en") |
'en'
|
output_path
|
Optional[Path]
|
Optional output directory (uses current dir if None) |
None
|
Returns:
| Type | Description |
|---|---|
VideoTranscript
|
TranscriptResource containing TTML file path and metadata |
Raises:
| Type | Description |
|---|---|
TranscriptError
|
If no transcript found for specified language |
Source code in src/tnh_scholar/video_processing/video_processing.py
195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | |
get_video(url, quality=None, output_path=None)
Download the full video with associated metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
quality
|
Optional[str]
|
yt-dlp format string (default: highest available) |
None
|
output_path
|
Optional[Path]
|
Optional output directory |
None
|
Returns:
| Type | Description |
|---|---|
VideoFile
|
VideoFile containing video file path and metadata |
Raises:
| Type | Description |
|---|---|
VideoDownloadError
|
If download fails |
Source code in src/tnh_scholar/video_processing/video_processing.py
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | |
DownloadError
Bases: VideoProcessingError
Raised for download-related errors.
Source code in src/tnh_scholar/video_processing/video_processing.py
89 90 91 | |
TranscriptError
Bases: VideoProcessingError
Raised for transcript-related errors.
Source code in src/tnh_scholar/video_processing/video_processing.py
85 86 87 | |
VideoAudio
dataclass
Bases: VideoResource
Source code in src/tnh_scholar/video_processing/video_processing.py
106 107 | |
VideoDownloadError
Bases: VideoProcessingError
Raised for video download-related errors.
Source code in src/tnh_scholar/video_processing/video_processing.py
93 94 95 | |
VideoFile
dataclass
Bases: VideoResource
Represents a downloaded video file and its metadata.
Source code in src/tnh_scholar/video_processing/video_processing.py
109 110 111 | |
VideoProcessingError
Bases: Exception
Base exception for video processing errors.
Source code in src/tnh_scholar/video_processing/video_processing.py
81 82 83 | |
VideoResource
dataclass
Base class for all video resources.
Source code in src/tnh_scholar/video_processing/video_processing.py
97 98 99 100 101 | |
filepath = None
class-attribute
instance-attribute
metadata
instance-attribute
__init__(metadata, filepath=None)
VideoTranscript
dataclass
Bases: VideoResource
Source code in src/tnh_scholar/video_processing/video_processing.py
103 104 | |
YTDownloader
Abstract base class for YouTube content retrieval.
Source code in src/tnh_scholar/video_processing/video_processing.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | |
get_audio(url, start, end, output_path)
Extract audio with associated metadata.
Source code in src/tnh_scholar/video_processing/video_processing.py
126 127 128 129 130 131 132 133 134 | |
get_metadata(url)
Retrieve video metadata only.
Source code in src/tnh_scholar/video_processing/video_processing.py
136 137 138 139 140 141 | |
get_transcript(url, lang='en', output_path=None)
Retrieve video transcript with associated metadata.
Source code in src/tnh_scholar/video_processing/video_processing.py
117 118 119 120 121 122 123 124 | |
get_video(url, quality=None, output_path=None)
Download the full video with associated metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
quality
|
Optional[str]
|
yt-dlp format string (default: highest available) |
None
|
output_path
|
Optional[Path]
|
Optional output directory |
None
|
Returns:
| Type | Description |
|---|---|
VideoFile
|
VideoFile containing video file path and metadata |
Raises:
| Type | Description |
|---|---|
VideoDownloadError
|
If download fails |
Source code in src/tnh_scholar/video_processing/video_processing.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | |
extract_text_from_ttml(ttml_path)
Extract plain text content from TTML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ttml_path
|
Path
|
Path to TTML transcript file |
required |
Returns:
| Type | Description |
|---|---|
str
|
Plain text content with one sentence per line |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file doesn't exist or has invalid content |
Source code in src/tnh_scholar/video_processing/video_processing.py
389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 | |
get_youtube_urls_from_csv(file_path)
Reads a CSV file containing YouTube URLs and titles, logs the titles, and returns a list of URLs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the CSV file containing YouTube URLs and titles. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: List of YouTube URLs. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the CSV file is improperly formatted. |
Source code in src/tnh_scholar/video_processing/video_processing.py
430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 | |
video_processing_old1
DEFAULT_TRANSCRIPT_DIR = Path.home() / '.yt_dlp_transcripts'
module-attribute
DEFAULT_TRANSCRIPT_OPTIONS = {'skip_download': True, 'quiet': True, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger}
module-attribute
logger = get_child_logger(__name__)
module-attribute
SubtitleTrack
Bases: TypedDict
Type definition for a subtitle track entry.
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
55 56 57 58 59 60 | |
ext
instance-attribute
name
instance-attribute
url
instance-attribute
TranscriptNotFoundError
Bases: Exception
Raised when no transcript is available for the requested language.
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
language = language
instance-attribute
video_url = video_url
instance-attribute
__init__(video_url, language)
Initialize TranscriptNotFoundError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
video_url
|
str
|
URL of the video where transcript was not found |
required |
language
|
str
|
Language code that was requested |
required |
available_manual
|
List of available manual transcript languages |
required | |
available_auto
|
List of available auto-generated transcript languages |
required |
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
VideoInfo
Bases: TypedDict
Type definition for relevant video info fields.
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
63 64 65 66 67 | |
automatic_captions
instance-attribute
subtitles
instance-attribute
download_audio_yt(url, output_dir, start_time=None, prompt_overwrite=True)
Downloads audio from a YouTube video using yt_dlp.YoutubeDL, with an optional start time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL of the YouTube video. |
required |
output_dir
|
Path
|
Directory to save the downloaded audio file. |
required |
start_time
|
str
|
Optional start time (e.g., '00:01:30' for 1 minute 30 seconds). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Path to the downloaded audio file. |
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | |
get_transcript(url, lang='en', download_dir=DEFAULT_TRANSCRIPT_DIR, keep_transcript_file=False)
Downloads and extracts the transcript for a given YouTube video URL.
Retrieves the transcript file, extracts the text content, and returns the raw text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL of the YouTube video. |
required |
lang
|
str
|
The language code for the transcript (default: 'en'). |
'en'
|
download_dir
|
Path
|
The directory to download the transcript to. |
DEFAULT_TRANSCRIPT_DIR
|
keep_transcript_file
|
bool
|
Whether to keep the downloaded transcript file (default: False). |
False
|
Returns:
| Type | Description |
|---|---|
str
|
The extracted transcript text. |
Raises:
| Type | Description |
|---|---|
TranscriptNotFoundError
|
If no transcript is available in the specified language. |
DownloadError
|
If video info extraction or download fails. |
ValueError
|
If the downloaded transcript file is invalid or empty. |
ParseError
|
If XML parsing of the transcript fails. |
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
get_transcript_info(video_url, lang='en')
Retrieves the transcript URL for a video in the specified language.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
video_url
|
str
|
The URL of the video |
required |
lang
|
str
|
The desired language code |
'en'
|
Returns:
| Type | Description |
|---|---|
|
URL of the transcript |
Raises:
| Type | Description |
|---|---|
TranscriptNotFoundError
|
If no transcript is available in the specified language |
DownloadError
|
If video info extraction fails |
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 | |
get_video_download_path_yt(output_dir, url)
Extracts the video title using yt-dlp.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The YouTube URL. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
Path
|
The title of the video. |
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
get_youtube_urls_from_csv(file_path)
Reads a CSV file containing YouTube URLs and titles, logs the titles, and returns a list of URLs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the CSV file containing YouTube URLs and titles. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: List of YouTube URLs. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the CSV file is improperly formatted. |
Source code in src/tnh_scholar/video_processing/video_processing_old1.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | |
video_processing_old2
AUDIO_DOWNLOAD_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'noplaylist': True}
module-attribute
BASE_YDL_OPTIONS = {'quiet': True, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger}
module-attribute
DEFAULT_METADATA_FIELDS = ['id', 'title', 'description', 'duration', 'upload_date', 'uploader', 'channel_url', 'webpage_url', 'original_url', 'channel', 'language', 'categories', 'tags']
module-attribute
DEFAULT_TRANSCRIPT_DIR = Path.home() / '.yt_dlp_transcripts'
module-attribute
TRANSCRIPT_OPTIONS = BASE_YDL_OPTIONS | {'writesubtitles': True, 'writeautomaticsub': True, 'subtitlesformat': 'ttml'}
module-attribute
logger = get_child_logger(__name__)
module-attribute
SubtitleTrack
Bases: TypedDict
Type definition for a subtitle track entry.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
81 82 83 84 85 | |
ext
instance-attribute
name
instance-attribute
url
instance-attribute
TranscriptNotFoundError
Bases: Exception
Raised when no transcript is available for the requested language.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
92 93 94 95 96 97 98 99 | |
language = language
instance-attribute
video_url = video_url
instance-attribute
__init__(video_url, language)
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
94 95 96 97 98 99 | |
VideoDownload
dataclass
Bases: VideoMetadata
Result of download operations.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
76 77 78 79 | |
filepath
instance-attribute
__init__(metadata, filepath)
VideoInfo
Bases: TypedDict
Type definition for relevant video info fields.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
87 88 89 90 | |
automatic_captions
instance-attribute
subtitles
instance-attribute
VideoMetadata
dataclass
Base class for video operations containing common metadata.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
66 67 68 69 | |
metadata
instance-attribute
__init__(metadata)
VideoTranscript
dataclass
Bases: VideoMetadata
Result of transcript operations.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
71 72 73 74 | |
content
instance-attribute
__init__(metadata, content)
download_audio_yt(url, output_dir, start_time=None)
Downloads audio from YouTube URL with optional start time.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
get_transcript(url, lang='en', download_dir=DEFAULT_TRANSCRIPT_DIR, keep_transcript_file=False)
Downloads and extracts transcript with metadata.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | |
get_video_download_path_yt(output_dir, url)
Get video metadata and expected download path.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
get_video_metadata(url)
Get metadata for a YouTube video without downloading content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
Returns:
| Type | Description |
|---|---|
VideoResult
|
VideoResult with only metadata field populated |
Raises:
| Type | Description |
|---|---|
DownloadError
|
If video info extraction fails |
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 | |
get_youtube_urls_from_csv(file_path)
Reads YouTube URLs from a CSV file containing URLs and titles.
Source code in src/tnh_scholar/video_processing/video_processing_old2.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | |
yt_transcribe
DEFAULT_CHUNK_DURATION_MS = 10 * 60 * 1000
module-attribute
DEFAULT_CHUNK_DURATION_S = 10 * 60
module-attribute
DEFAULT_OUTPUT_DIR = './video_transcriptions'
module-attribute
DEFAULT_PROMPT = 'Dharma, Deer Park, Thay, Thich Nhat Hanh, Bodhicitta, Bodhisattva, Mahayana'
module-attribute
EXPECTED_ENV = 'tnh-scholar'
module-attribute
args = parser.parse_args()
module-attribute
group = parser.add_mutually_exclusive_group(required=True)
module-attribute
logger = get_child_logger('yt_transcribe')
module-attribute
output_directory = Path(args.output_dir)
module-attribute
parser = argparse.ArgumentParser(description='Transcribe YouTube videos from a URL or a file containing URLs.')
module-attribute
url_file = Path(args.file)
module-attribute
video_urls = []
module-attribute
check_conda_env()
Source code in src/tnh_scholar/video_processing/yt_transcribe.py
31 32 33 34 35 36 37 38 39 | |
transcribe_youtube_videos(urls, output_base_dir, max_chunk_duration=DEFAULT_CHUNK_DURATION_S, start=None, translate=False)
Full pipeline for transcribing a list of YouTube videos.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
urls
|
list[str]
|
List of YouTube video URLs. |
required |
output_base_dir
|
Path
|
Base directory for storing output. |
required |
max_chunk_duration
|
int
|
Maximum duration for audio chunks in seconds (default is 10 minutes). |
DEFAULT_CHUNK_DURATION_S
|
Source code in src/tnh_scholar/video_processing/yt_transcribe.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | |
xml_processing
FormattingError
Bases: Exception
Custom exception raised for formatting-related errors.
Source code in src/tnh_scholar/xml_processing/xml_processing.py
7 8 9 10 11 12 13 | |
__init__(message='An error occurred due to invalid formatting.')
Source code in src/tnh_scholar/xml_processing/xml_processing.py
12 13 | |
PagebreakXMLParser
Parses XML documents split by
Source code in src/tnh_scholar/xml_processing/xml_processing.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | |
cleaned_text = ''
instance-attribute
original_text = text
instance-attribute
pagebreak_tags = []
instance-attribute
pages = []
instance-attribute
__init__(text)
Source code in src/tnh_scholar/xml_processing/xml_processing.py
146 147 148 149 150 151 152 153 154 155 156 | |
parse(page_groups=None, keep_pagebreaks=True)
Parses the XML and returns a list of page contents, optionally grouped and with pagebreaks retained.
Source code in src/tnh_scholar/xml_processing/xml_processing.py
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | |
join_xml_data_to_doc(file_path, data, overwrite=False)
Joins a list of XML-tagged data with newlines, wraps it with
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the output file. |
required |
data
|
List[str]
|
List of XML-tagged data strings. |
required |
overwrite
|
bool
|
Whether to overwrite the file if it exists. |
False
|
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the file exists and overwrite is False. |
ValueError
|
If the data list is empty. |
Example
join_xml_data_to_doc(Path("output.xml"), ["
Data "], overwrite=True)
Source code in src/tnh_scholar/xml_processing/xml_processing.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
remove_page_tags(text)
Removes
Parameters:
- text (str): The input text containing
Returns:
- str: The text with
Source code in src/tnh_scholar/xml_processing/xml_processing.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
save_pages_to_xml(output_xml_path, text_pages, overwrite=False)
Generates and saves an XML file containing text pages, with a
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_xml_path
|
Path
|
The Path object for the file where the XML file will be saved. |
required |
text_pages
|
List[str]
|
A list of strings, each representing the text content of a page. |
required |
overwrite
|
bool
|
If True, overwrites the file if it exists. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input list of text_pages is empty or contains invalid types. |
FileExistsError
|
If the file already exists and overwrite is False. |
PermissionError
|
If the file cannot be created due to insufficient permissions. |
OSError
|
For other file I/O-related errors. |
Source code in src/tnh_scholar/xml_processing/xml_processing.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | |
split_xml_on_pagebreaks(text, page_groups=None, keep_pagebreaks=True)
Splits an XML document into individual pages based on
Source code in src/tnh_scholar/xml_processing/xml_processing.py
217 218 219 220 221 222 223 224 225 226 227 228 | |
split_xml_pages(text)
Backwards-compatible helper that returns the page contents without pagebreak tags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
XML document string. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of page strings. |
Source code in src/tnh_scholar/xml_processing/xml_processing.py
231 232 233 234 235 236 237 238 239 240 241 | |
extract_tags
extract_unique_tags(xml_file)
Extract all unique tags from an XML file using lxml.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
xml_file
|
str
|
Path to the XML file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
set |
A set of unique tags in the XML document. |
Source code in src/tnh_scholar/xml_processing/extract_tags.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
main()
Source code in src/tnh_scholar/xml_processing/extract_tags.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
xml_processing
FormattingError
Bases: Exception
Custom exception raised for formatting-related errors.
Source code in src/tnh_scholar/xml_processing/xml_processing.py
7 8 9 10 11 12 13 | |
__init__(message='An error occurred due to invalid formatting.')
Source code in src/tnh_scholar/xml_processing/xml_processing.py
12 13 | |
PagebreakXMLParser
Parses XML documents split by
Source code in src/tnh_scholar/xml_processing/xml_processing.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | |
cleaned_text = ''
instance-attribute
original_text = text
instance-attribute
pagebreak_tags = []
instance-attribute
pages = []
instance-attribute
__init__(text)
Source code in src/tnh_scholar/xml_processing/xml_processing.py
146 147 148 149 150 151 152 153 154 155 156 | |
parse(page_groups=None, keep_pagebreaks=True)
Parses the XML and returns a list of page contents, optionally grouped and with pagebreaks retained.
Source code in src/tnh_scholar/xml_processing/xml_processing.py
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | |
join_xml_data_to_doc(file_path, data, overwrite=False)
Joins a list of XML-tagged data with newlines, wraps it with
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the output file. |
required |
data
|
List[str]
|
List of XML-tagged data strings. |
required |
overwrite
|
bool
|
Whether to overwrite the file if it exists. |
False
|
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the file exists and overwrite is False. |
ValueError
|
If the data list is empty. |
Example
join_xml_data_to_doc(Path("output.xml"), ["
Data "], overwrite=True)
Source code in src/tnh_scholar/xml_processing/xml_processing.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
remove_page_tags(text)
Removes
Parameters:
- text (str): The input text containing
Returns:
- str: The text with
Source code in src/tnh_scholar/xml_processing/xml_processing.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
save_pages_to_xml(output_xml_path, text_pages, overwrite=False)
Generates and saves an XML file containing text pages, with a
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_xml_path
|
Path
|
The Path object for the file where the XML file will be saved. |
required |
text_pages
|
List[str]
|
A list of strings, each representing the text content of a page. |
required |
overwrite
|
bool
|
If True, overwrites the file if it exists. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input list of text_pages is empty or contains invalid types. |
FileExistsError
|
If the file already exists and overwrite is False. |
PermissionError
|
If the file cannot be created due to insufficient permissions. |
OSError
|
For other file I/O-related errors. |
Source code in src/tnh_scholar/xml_processing/xml_processing.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | |
split_xml_on_pagebreaks(text, page_groups=None, keep_pagebreaks=True)
Splits an XML document into individual pages based on
Source code in src/tnh_scholar/xml_processing/xml_processing.py
217 218 219 220 221 222 223 224 225 226 227 228 | |
split_xml_pages(text)
Backwards-compatible helper that returns the page contents without pagebreak tags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
XML document string. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of page strings. |
Source code in src/tnh_scholar/xml_processing/xml_processing.py
231 232 233 234 235 236 237 238 239 240 241 | |